linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 00/14] mm: BPF OOM
@ 2025-08-18 17:01 Roman Gushchin
  2025-08-18 17:01 ` [PATCH v1 01/14] mm: introduce bpf struct ops for OOM handling Roman Gushchin
                   ` (15 more replies)
  0 siblings, 16 replies; 67+ messages in thread
From: Roman Gushchin @ 2025-08-18 17:01 UTC (permalink / raw)
  To: linux-mm, bpf
  Cc: Suren Baghdasaryan, Johannes Weiner, Michal Hocko, David Rientjes,
	Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel, Roman Gushchin

This patchset adds an ability to customize the out of memory
handling using bpf.

It focuses on two parts:
1) OOM handling policy,
2) PSI-based OOM invocation.

The idea to use bpf for customizing the OOM handling is not new, but
unlike the previous proposal [1], which augmented the existing task
ranking policy, this one tries to be as generic as possible and
leverage the full power of the modern bpf.

It provides a generic interface which is called before the existing OOM
killer code and allows implementing any policy, e.g. picking a victim
task or memory cgroup or potentially even releasing memory in other
ways, e.g. deleting tmpfs files (the last one might require some
additional but relatively simple changes).

The past attempt to implement memory-cgroup aware policy [2] showed
that there are multiple opinions on what the best policy is.  As it's
highly workload-dependent and specific to a concrete way of organizing
workloads, the structure of the cgroup tree etc, a customizable
bpf-based implementation is preferable over a in-kernel implementation
with a dozen on sysctls.

The second part is related to the fundamental question on when to
declare the OOM event. It's a trade-off between the risk of
unnecessary OOM kills and associated work losses and the risk of
infinite trashing and effective soft lockups.  In the last few years
several PSI-based userspace solutions were developed (e.g. OOMd [3] or
systemd-OOMd [4]). The common idea was to use userspace daemons to
implement custom OOM logic as well as rely on PSI monitoring to avoid
stalls. In this scenario the userspace daemon was supposed to handle
the majority of OOMs, while the in-kernel OOM killer worked as the
last resort measure to guarantee that the system would never deadlock
on the memory. But this approach creates additional infrastructure
churn: userspace OOM daemon is a separate entity which needs to be
deployed, updated, monitored. A completely different pipeline needs to
be built to monitor both types of OOM events and collect associated
logs. A userspace daemon is more restricted in terms on what data is
available to it. Implementing a daemon which can work reliably under a
heavy memory pressure in the system is also tricky.

[1]: https://lwn.net/ml/linux-kernel/20230810081319.65668-1-zhouchuyi@bytedance.com/
[2]: https://lore.kernel.org/lkml/20171130152824.1591-1-guro@fb.com/
[3]: https://github.com/facebookincubator/oomd
[4]: https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html

----

v1:
  1) Both OOM and PSI parts are now implemented using bpf struct ops,
     providing a path the future extensions (suggested by Kumar Kartikeya Dwivedi,
     Song Liu and Matt Bobrowski)
  2) It's possible to create PSI triggers from BPF, no need for an additional
     userspace agent. (suggested by Suren Baghdasaryan)
     Also there is now a callback for the cgroup release event.
  3) Added an ability to block on oom_lock instead of bailing out (suggested by Michal Hocko)
  4) Added bpf_task_is_oom_victim (suggested by Michal Hocko)
  5) PSI callbacks are scheduled using a separate workqueue (suggested by Suren Baghdasaryan)

RFC:
  https://lwn.net/ml/all/20250428033617.3797686-1-roman.gushchin@linux.dev/


Roman Gushchin (14):
  mm: introduce bpf struct ops for OOM handling
  bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL
  mm: introduce bpf_oom_kill_process() bpf kfunc
  mm: introduce bpf kfuncs to deal with memcg pointers
  mm: introduce bpf_get_root_mem_cgroup() bpf kfunc
  mm: introduce bpf_out_of_memory() bpf kfunc
  mm: allow specifying custom oom constraint for bpf triggers
  mm: introduce bpf_task_is_oom_victim() kfunc
  bpf: selftests: introduce read_cgroup_file() helper
  bpf: selftests: bpf OOM handler test
  sched: psi: refactor psi_trigger_create()
  sched: psi: implement psi trigger handling using bpf
  sched: psi: implement bpf_psi_create_trigger() kfunc
  bpf: selftests: psi struct ops test

 include/linux/bpf_oom.h                       |  49 +++
 include/linux/bpf_psi.h                       |  71 ++++
 include/linux/memcontrol.h                    |   2 +
 include/linux/oom.h                           |  12 +
 include/linux/psi.h                           |  15 +-
 include/linux/psi_types.h                     |  72 +++-
 kernel/bpf/verifier.c                         |   5 +
 kernel/cgroup/cgroup.c                        |  14 +-
 kernel/sched/bpf_psi.c                        | 337 ++++++++++++++++++
 kernel/sched/build_utility.c                  |   4 +
 kernel/sched/psi.c                            | 130 +++++--
 mm/Makefile                                   |   4 +
 mm/bpf_memcontrol.c                           | 166 +++++++++
 mm/bpf_oom.c                                  | 157 ++++++++
 mm/oom_kill.c                                 | 182 +++++++++-
 tools/testing/selftests/bpf/cgroup_helpers.c  |  39 ++
 tools/testing/selftests/bpf/cgroup_helpers.h  |   2 +
 .../selftests/bpf/prog_tests/test_oom.c       | 229 ++++++++++++
 .../selftests/bpf/prog_tests/test_psi.c       | 224 ++++++++++++
 tools/testing/selftests/bpf/progs/test_oom.c  | 108 ++++++
 tools/testing/selftests/bpf/progs/test_psi.c  |  76 ++++
 21 files changed, 1845 insertions(+), 53 deletions(-)
 create mode 100644 include/linux/bpf_oom.h
 create mode 100644 include/linux/bpf_psi.h
 create mode 100644 kernel/sched/bpf_psi.c
 create mode 100644 mm/bpf_memcontrol.c
 create mode 100644 mm/bpf_oom.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_oom.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_psi.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_oom.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_psi.c

-- 
2.50.1



^ permalink raw reply	[flat|nested] 67+ messages in thread

* [PATCH v1 01/14] mm: introduce bpf struct ops for OOM handling
  2025-08-18 17:01 [PATCH v1 00/14] mm: BPF OOM Roman Gushchin
@ 2025-08-18 17:01 ` Roman Gushchin
  2025-08-19  4:09   ` Suren Baghdasaryan
                     ` (2 more replies)
  2025-08-18 17:01 ` [PATCH v1 02/14] bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL Roman Gushchin
                   ` (14 subsequent siblings)
  15 siblings, 3 replies; 67+ messages in thread
From: Roman Gushchin @ 2025-08-18 17:01 UTC (permalink / raw)
  To: linux-mm, bpf
  Cc: Suren Baghdasaryan, Johannes Weiner, Michal Hocko, David Rientjes,
	Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel, Roman Gushchin

Introduce a bpf struct ops for implementing custom OOM handling policies.

The struct ops provides the bpf_handle_out_of_memory() callback,
which expected to return 1 if it was able to free some memory and 0
otherwise.

In the latter case it's guaranteed that the in-kernel OOM killer will
be invoked. Otherwise the kernel also checks the bpf_memory_freed
field of the oom_control structure, which is expected to be set by
kfuncs suitable for releasing memory. It's a safety mechanism which
prevents a bpf program to claim forward progress without actually
releasing memory. The callback program is sleepable to enable using
iterators, e.g. cgroup iterators.

The callback receives struct oom_control as an argument, so it can
easily filter out OOM's it doesn't want to handle, e.g. global vs
memcg OOM's.

The callback is executed just before the kernel victim task selection
algorithm, so all heuristics and sysctls like panic on oom,
sysctl_oom_kill_allocating_task and sysctl_oom_kill_allocating_task
are respected.

The struct ops also has the name field, which allows to define a
custom name for the implemented policy. It's printed in the OOM report
in the oom_policy=<policy> format. "default" is printed if bpf is not
used or policy name is not specified.

[  112.696676] test_progs invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
               oom_policy=bpf_test_policy
[  112.698160] CPU: 1 UID: 0 PID: 660 Comm: test_progs Not tainted 6.16.0-00015-gf09eb0d6badc #102 PREEMPT(full)
[  112.698165] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-5.fc42 04/01/2014
[  112.698167] Call Trace:
[  112.698177]  <TASK>
[  112.698182]  dump_stack_lvl+0x4d/0x70
[  112.698192]  dump_header+0x59/0x1c6
[  112.698199]  oom_kill_process.cold+0x8/0xef
[  112.698206]  bpf_oom_kill_process+0x59/0xb0
[  112.698216]  bpf_prog_7ecad0f36a167fd7_test_out_of_memory+0x2be/0x313
[  112.698229]  bpf__bpf_oom_ops_handle_out_of_memory+0x47/0xaf
[  112.698236]  ? srso_alias_return_thunk+0x5/0xfbef5
[  112.698240]  bpf_handle_oom+0x11a/0x1e0
[  112.698250]  out_of_memory+0xab/0x5c0
[  112.698258]  mem_cgroup_out_of_memory+0xbc/0x110
[  112.698274]  try_charge_memcg+0x4b5/0x7e0
[  112.698288]  charge_memcg+0x2f/0xc0
[  112.698293]  __mem_cgroup_charge+0x30/0xc0
[  112.698299]  do_anonymous_page+0x40f/0xa50
[  112.698311]  __handle_mm_fault+0xbba/0x1140
[  112.698317]  ? srso_alias_return_thunk+0x5/0xfbef5
[  112.698335]  handle_mm_fault+0xe6/0x370
[  112.698343]  do_user_addr_fault+0x211/0x6a0
[  112.698354]  exc_page_fault+0x75/0x1d0
[  112.698363]  asm_exc_page_fault+0x26/0x30
[  112.698366] RIP: 0033:0x7fa97236db00

It's possible to load multiple bpf struct programs. In the case of
oom, they will be executed one by one in the same order they been
loaded until one of them returns 1 and bpf_memory_freed is set to 1
- an indication that the memory was freed. This allows to have
multiple bpf programs to focus on different types of OOM's - e.g.
one program can only handle memcg OOM's in one memory cgroup.
But the filtering is done in bpf - so it's fully flexible.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 include/linux/bpf_oom.h |  49 +++++++++++++
 include/linux/oom.h     |   8 ++
 mm/Makefile             |   3 +
 mm/bpf_oom.c            | 157 ++++++++++++++++++++++++++++++++++++++++
 mm/oom_kill.c           |  22 +++++-
 5 files changed, 237 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/bpf_oom.h
 create mode 100644 mm/bpf_oom.c

diff --git a/include/linux/bpf_oom.h b/include/linux/bpf_oom.h
new file mode 100644
index 000000000000..29cb5ea41d97
--- /dev/null
+++ b/include/linux/bpf_oom.h
@@ -0,0 +1,49 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+
+#ifndef __BPF_OOM_H
+#define __BPF_OOM_H
+
+struct bpf_oom;
+struct oom_control;
+
+#define BPF_OOM_NAME_MAX_LEN 64
+
+struct bpf_oom_ops {
+	/**
+	 * @handle_out_of_memory: Out of memory bpf handler, called before
+	 * the in-kernel OOM killer.
+	 * @oc: OOM control structure
+	 *
+	 * Should return 1 if some memory was freed up, otherwise
+	 * the in-kernel OOM killer is invoked.
+	 */
+	int (*handle_out_of_memory)(struct oom_control *oc);
+
+	/**
+	 * @name: BPF OOM policy name
+	 */
+	char name[BPF_OOM_NAME_MAX_LEN];
+
+	/* Private */
+	struct bpf_oom *bpf_oom;
+};
+
+#ifdef CONFIG_BPF_SYSCALL
+/**
+ * @bpf_handle_oom: handle out of memory using bpf programs
+ * @oc: OOM control structure
+ *
+ * Returns true if a bpf oom program was executed, returned 1
+ * and some memory was actually freed.
+ */
+bool bpf_handle_oom(struct oom_control *oc);
+
+#else /* CONFIG_BPF_SYSCALL */
+static inline bool bpf_handle_oom(struct oom_control *oc)
+{
+	return false;
+}
+
+#endif /* CONFIG_BPF_SYSCALL */
+
+#endif /* __BPF_OOM_H */
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 1e0fc6931ce9..ef453309b7ea 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -51,6 +51,14 @@ struct oom_control {
 
 	/* Used to print the constraint info. */
 	enum oom_constraint constraint;
+
+#ifdef CONFIG_BPF_SYSCALL
+	/* Used by the bpf oom implementation to mark the forward progress */
+	bool bpf_memory_freed;
+
+	/* Policy name */
+	const char *bpf_policy_name;
+#endif
 };
 
 extern struct mutex oom_lock;
diff --git a/mm/Makefile b/mm/Makefile
index 1a7a11d4933d..a714aba03759 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -105,6 +105,9 @@ obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
 ifdef CONFIG_SWAP
 obj-$(CONFIG_MEMCG) += swap_cgroup.o
 endif
+ifdef CONFIG_BPF_SYSCALL
+obj-y += bpf_oom.o
+endif
 obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
 obj-$(CONFIG_GUP_TEST) += gup_test.o
 obj-$(CONFIG_DMAPOOL_TEST) += dmapool_test.o
diff --git a/mm/bpf_oom.c b/mm/bpf_oom.c
new file mode 100644
index 000000000000..47633046819c
--- /dev/null
+++ b/mm/bpf_oom.c
@@ -0,0 +1,157 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * BPF-driven OOM killer customization
+ *
+ * Author: Roman Gushchin <roman.gushchin@linux.dev>
+ */
+
+#include <linux/bpf.h>
+#include <linux/oom.h>
+#include <linux/bpf_oom.h>
+#include <linux/srcu.h>
+
+DEFINE_STATIC_SRCU(bpf_oom_srcu);
+static DEFINE_SPINLOCK(bpf_oom_lock);
+static LIST_HEAD(bpf_oom_handlers);
+
+struct bpf_oom {
+	struct bpf_oom_ops *ops;
+	struct list_head node;
+	struct srcu_struct srcu;
+};
+
+bool bpf_handle_oom(struct oom_control *oc)
+{
+	struct bpf_oom_ops *ops;
+	struct bpf_oom *bpf_oom;
+	int list_idx, idx, ret = 0;
+
+	oc->bpf_memory_freed = false;
+
+	list_idx = srcu_read_lock(&bpf_oom_srcu);
+	list_for_each_entry_srcu(bpf_oom, &bpf_oom_handlers, node, false) {
+		ops = READ_ONCE(bpf_oom->ops);
+		if (!ops || !ops->handle_out_of_memory)
+			continue;
+		idx = srcu_read_lock(&bpf_oom->srcu);
+		oc->bpf_policy_name = ops->name[0] ? &ops->name[0] :
+			"bpf_defined_policy";
+		ret = ops->handle_out_of_memory(oc);
+		oc->bpf_policy_name = NULL;
+		srcu_read_unlock(&bpf_oom->srcu, idx);
+
+		if (ret && oc->bpf_memory_freed)
+			break;
+	}
+	srcu_read_unlock(&bpf_oom_srcu, list_idx);
+
+	return ret && oc->bpf_memory_freed;
+}
+
+static int __handle_out_of_memory(struct oom_control *oc)
+{
+	return 0;
+}
+
+static struct bpf_oom_ops __bpf_oom_ops = {
+	.handle_out_of_memory = __handle_out_of_memory,
+};
+
+static const struct bpf_func_proto *
+bpf_oom_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	return tracing_prog_func_proto(func_id, prog);
+}
+
+static bool bpf_oom_ops_is_valid_access(int off, int size,
+					enum bpf_access_type type,
+					const struct bpf_prog *prog,
+					struct bpf_insn_access_aux *info)
+{
+	return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
+}
+
+static const struct bpf_verifier_ops bpf_oom_verifier_ops = {
+	.get_func_proto = bpf_oom_func_proto,
+	.is_valid_access = bpf_oom_ops_is_valid_access,
+};
+
+static int bpf_oom_ops_reg(void *kdata, struct bpf_link *link)
+{
+	struct bpf_oom_ops *ops = kdata;
+	struct bpf_oom *bpf_oom;
+	int ret;
+
+	bpf_oom = kmalloc(sizeof(*bpf_oom), GFP_KERNEL_ACCOUNT);
+	if (!bpf_oom)
+		return -ENOMEM;
+
+	ret = init_srcu_struct(&bpf_oom->srcu);
+	if (ret) {
+		kfree(bpf_oom);
+		return ret;
+	}
+
+	WRITE_ONCE(bpf_oom->ops, ops);
+	ops->bpf_oom = bpf_oom;
+
+	spin_lock(&bpf_oom_lock);
+	list_add_rcu(&bpf_oom->node, &bpf_oom_handlers);
+	spin_unlock(&bpf_oom_lock);
+
+	return 0;
+}
+
+static void bpf_oom_ops_unreg(void *kdata, struct bpf_link *link)
+{
+	struct bpf_oom_ops *ops = kdata;
+	struct bpf_oom *bpf_oom = ops->bpf_oom;
+
+	WRITE_ONCE(bpf_oom->ops, NULL);
+
+	spin_lock(&bpf_oom_lock);
+	list_del_rcu(&bpf_oom->node);
+	spin_unlock(&bpf_oom_lock);
+
+	synchronize_srcu(&bpf_oom->srcu);
+
+	kfree(bpf_oom);
+}
+
+static int bpf_oom_ops_init_member(const struct btf_type *t,
+				   const struct btf_member *member,
+				   void *kdata, const void *udata)
+{
+	const struct bpf_oom_ops *uops = (const struct bpf_oom_ops *)udata;
+	struct bpf_oom_ops *ops = (struct bpf_oom_ops *)kdata;
+	u32 moff = __btf_member_bit_offset(t, member) / 8;
+
+	switch (moff) {
+	case offsetof(struct bpf_oom_ops, name):
+		strscpy_pad(ops->name, uops->name, sizeof(ops->name));
+		return 1;
+	}
+	return 0;
+}
+
+static int bpf_oom_ops_init(struct btf *btf)
+{
+	return 0;
+}
+
+static struct bpf_struct_ops bpf_oom_bpf_ops = {
+	.verifier_ops = &bpf_oom_verifier_ops,
+	.reg = bpf_oom_ops_reg,
+	.unreg = bpf_oom_ops_unreg,
+	.init_member = bpf_oom_ops_init_member,
+	.init = bpf_oom_ops_init,
+	.name = "bpf_oom_ops",
+	.owner = THIS_MODULE,
+	.cfi_stubs = &__bpf_oom_ops
+};
+
+static int __init bpf_oom_struct_ops_init(void)
+{
+	return register_bpf_struct_ops(&bpf_oom_bpf_ops, bpf_oom_ops);
+}
+late_initcall(bpf_oom_struct_ops_init);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 25923cfec9c6..ad7bd65061d6 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -45,6 +45,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/cred.h>
 #include <linux/nmi.h>
+#include <linux/bpf_oom.h>
 
 #include <asm/tlb.h>
 #include "internal.h"
@@ -246,6 +247,15 @@ static const char * const oom_constraint_text[] = {
 	[CONSTRAINT_MEMCG] = "CONSTRAINT_MEMCG",
 };
 
+static const char *oom_policy_name(struct oom_control *oc)
+{
+#ifdef CONFIG_BPF_SYSCALL
+	if (oc->bpf_policy_name)
+		return oc->bpf_policy_name;
+#endif
+	return "default";
+}
+
 /*
  * Determine the type of allocation constraint.
  */
@@ -458,9 +468,10 @@ static void dump_oom_victim(struct oom_control *oc, struct task_struct *victim)
 
 static void dump_header(struct oom_control *oc)
 {
-	pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), order=%d, oom_score_adj=%hd\n",
+	pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), order=%d, oom_score_adj=%hd\noom_policy=%s\n",
 		current->comm, oc->gfp_mask, &oc->gfp_mask, oc->order,
-			current->signal->oom_score_adj);
+		current->signal->oom_score_adj,
+		oom_policy_name(oc));
 	if (!IS_ENABLED(CONFIG_COMPACTION) && oc->order)
 		pr_warn("COMPACTION is disabled!!!\n");
 
@@ -1161,6 +1172,13 @@ bool out_of_memory(struct oom_control *oc)
 		return true;
 	}
 
+	/*
+	 * Let bpf handle the OOM first. If it was able to free up some memory,
+	 * bail out. Otherwise fall back to the kernel OOM killer.
+	 */
+	if (bpf_handle_oom(oc))
+		return true;
+
 	select_bad_process(oc);
 	/* Found nothing?!?! */
 	if (!oc->chosen) {
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v1 02/14] bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL
  2025-08-18 17:01 [PATCH v1 00/14] mm: BPF OOM Roman Gushchin
  2025-08-18 17:01 ` [PATCH v1 01/14] mm: introduce bpf struct ops for OOM handling Roman Gushchin
@ 2025-08-18 17:01 ` Roman Gushchin
  2025-08-20  9:17   ` Kumar Kartikeya Dwivedi
  2025-08-18 17:01 ` [PATCH v1 03/14] mm: introduce bpf_oom_kill_process() bpf kfunc Roman Gushchin
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 67+ messages in thread
From: Roman Gushchin @ 2025-08-18 17:01 UTC (permalink / raw)
  To: linux-mm, bpf
  Cc: Suren Baghdasaryan, Johannes Weiner, Michal Hocko, David Rientjes,
	Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel, Roman Gushchin

Struct oom_control is used to describe the OOM context.
It's memcg field defines the scope of OOM: it's NULL for global
OOMs and a valid memcg pointer for memcg-scoped OOMs.
Teach bpf verifier to recognize it as trusted or NULL pointer.
It will provide the bpf OOM handler a trusted memcg pointer,
which for example is required for iterating the memcg's subtree.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 kernel/bpf/verifier.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 169845710c7e..b5153c843028 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -7035,6 +7035,10 @@ BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct socket) {
 	struct sock *sk;
 };
 
+BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct oom_control) {
+	struct mem_cgroup *memcg;
+};
+
 static bool type_is_rcu(struct bpf_verifier_env *env,
 			struct bpf_reg_state *reg,
 			const char *field_name, u32 btf_id)
@@ -7075,6 +7079,7 @@ static bool type_is_trusted_or_null(struct bpf_verifier_env *env,
 {
 	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct socket));
 	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct dentry));
+	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct oom_control));
 
 	return btf_nested_type_is_trusted(&env->log, reg, field_name, btf_id,
 					  "__safe_trusted_or_null");
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v1 03/14] mm: introduce bpf_oom_kill_process() bpf kfunc
  2025-08-18 17:01 [PATCH v1 00/14] mm: BPF OOM Roman Gushchin
  2025-08-18 17:01 ` [PATCH v1 01/14] mm: introduce bpf struct ops for OOM handling Roman Gushchin
  2025-08-18 17:01 ` [PATCH v1 02/14] bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL Roman Gushchin
@ 2025-08-18 17:01 ` Roman Gushchin
  2025-08-18 17:01 ` [PATCH v1 04/14] mm: introduce bpf kfuncs to deal with memcg pointers Roman Gushchin
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 67+ messages in thread
From: Roman Gushchin @ 2025-08-18 17:01 UTC (permalink / raw)
  To: linux-mm, bpf
  Cc: Suren Baghdasaryan, Johannes Weiner, Michal Hocko, David Rientjes,
	Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel, Roman Gushchin

Introduce bpf_oom_kill_process() bpf kfunc, which is supposed
to be used by bpf OOM programs. It allows to kill a process
in exactly the same way the OOM killer does: using the OOM reaper,
bumping corresponding memcg and global statistics, respecting
memory.oom.group etc.

On success, it sets om_control's bpf_memory_freed field to true,
enabling the bpf program to bypass the kernel OOM killer.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 mm/oom_kill.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ad7bd65061d6..25fc5e744e27 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1282,3 +1282,70 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
 	return -ENOSYS;
 #endif /* CONFIG_MMU */
 }
+
+#ifdef CONFIG_BPF_SYSCALL
+
+__bpf_kfunc_start_defs();
+/**
+ * bpf_oom_kill_process - Kill a process as OOM killer
+ * @oc: pointer to oom_control structure, describes OOM context
+ * @task: task to be killed
+ * @message__str: message to print in dmesg
+ *
+ * Kill a process in a way similar to the kernel OOM killer.
+ * This means dump the necessary information to dmesg, adjust memcg
+ * statistics, leverage the oom reaper, respect memory.oom.group etc.
+ *
+ * bpf_oom_kill_process() marks the forward progress by setting
+ * oc->bpf_memory_freed. If the progress was made, the bpf program
+ * is free to decide if the kernel oom killer should be invoked.
+ * Otherwise it's enforced, so that a bad bpf program can't
+ * deadlock the machine on memory.
+ */
+__bpf_kfunc int bpf_oom_kill_process(struct oom_control *oc,
+				     struct task_struct *task,
+				     const char *message__str)
+{
+	if (oom_unkillable_task(task))
+		return -EPERM;
+
+	/* paired with put_task_struct() in oom_kill_process() */
+	task = tryget_task_struct(task);
+	if (!task)
+		return -EINVAL;
+
+	oc->chosen = task;
+
+	oom_kill_process(oc, message__str);
+
+	oc->chosen = NULL;
+	oc->bpf_memory_freed = true;
+
+	return 0;
+}
+
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(bpf_oom_kfuncs)
+BTF_ID_FLAGS(func, bpf_oom_kill_process, KF_SLEEPABLE | KF_TRUSTED_ARGS)
+BTF_KFUNCS_END(bpf_oom_kfuncs)
+
+static const struct btf_kfunc_id_set bpf_oom_kfunc_set = {
+	.owner          = THIS_MODULE,
+	.set            = &bpf_oom_kfuncs,
+};
+
+static int __init bpf_oom_init(void)
+{
+	int err;
+
+	err = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
+					&bpf_oom_kfunc_set);
+	if (err)
+		pr_warn("error while registering bpf oom kfuncs: %d", err);
+
+	return err;
+}
+late_initcall(bpf_oom_init);
+
+#endif
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v1 04/14] mm: introduce bpf kfuncs to deal with memcg pointers
  2025-08-18 17:01 [PATCH v1 00/14] mm: BPF OOM Roman Gushchin
                   ` (2 preceding siblings ...)
  2025-08-18 17:01 ` [PATCH v1 03/14] mm: introduce bpf_oom_kill_process() bpf kfunc Roman Gushchin
@ 2025-08-18 17:01 ` Roman Gushchin
  2025-08-20  9:21   ` Kumar Kartikeya Dwivedi
  2025-08-18 17:01 ` [PATCH v1 05/14] mm: introduce bpf_get_root_mem_cgroup() bpf kfunc Roman Gushchin
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 67+ messages in thread
From: Roman Gushchin @ 2025-08-18 17:01 UTC (permalink / raw)
  To: linux-mm, bpf
  Cc: Suren Baghdasaryan, Johannes Weiner, Michal Hocko, David Rientjes,
	Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel, Roman Gushchin

To effectively operate with memory cgroups in bpf there is a need
to convert css pointers to memcg pointers. A simple container_of
cast which is used in the kernel code can't be used in bpf because
from the verifier's point of view that's a out-of-bounds memory access.

Introduce helper get/put kfuncs which can be used to get
a refcounted memcg pointer from the css pointer:
  - bpf_get_mem_cgroup,
  - bpf_put_mem_cgroup.

bpf_get_mem_cgroup() can take both memcg's css and the corresponding
cgroup's "self" css. It allows it to be used with the existing cgroup
iterator which iterates over cgroup tree, not memcg tree.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 include/linux/memcontrol.h |   2 +
 mm/Makefile                |   1 +
 mm/bpf_memcontrol.c        | 151 +++++++++++++++++++++++++++++++++++++
 3 files changed, 154 insertions(+)
 create mode 100644 mm/bpf_memcontrol.c

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 87b6688f124a..785a064000cd 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -932,6 +932,8 @@ static inline void mod_memcg_page_state(struct page *page,
 	rcu_read_unlock();
 }
 
+unsigned long memcg_events(struct mem_cgroup *memcg, int event);
+unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap);
 unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx);
 unsigned long lruvec_page_state(struct lruvec *lruvec, enum node_stat_item idx);
 unsigned long lruvec_page_state_local(struct lruvec *lruvec,
diff --git a/mm/Makefile b/mm/Makefile
index a714aba03759..c397af904a87 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -107,6 +107,7 @@ obj-$(CONFIG_MEMCG) += swap_cgroup.o
 endif
 ifdef CONFIG_BPF_SYSCALL
 obj-y += bpf_oom.o
+obj-$(CONFIG_MEMCG) += bpf_memcontrol.o
 endif
 obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
 obj-$(CONFIG_GUP_TEST) += gup_test.o
diff --git a/mm/bpf_memcontrol.c b/mm/bpf_memcontrol.c
new file mode 100644
index 000000000000..66f2a359af7e
--- /dev/null
+++ b/mm/bpf_memcontrol.c
@@ -0,0 +1,151 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Memory Controller-related BPF kfuncs and auxiliary code
+ *
+ * Author: Roman Gushchin <roman.gushchin@linux.dev>
+ */
+
+#include <linux/memcontrol.h>
+#include <linux/bpf.h>
+
+__bpf_kfunc_start_defs();
+
+/**
+ * bpf_get_mem_cgroup - Get a reference to a memory cgroup
+ * @css: pointer to the css structure
+ *
+ * Returns a pointer to a mem_cgroup structure after bumping
+ * the corresponding css's reference counter.
+ *
+ * It's fine to pass a css which belongs to any cgroup controller,
+ * e.g. unified hierarchy's main css.
+ *
+ * Implements KF_ACQUIRE semantics.
+ */
+__bpf_kfunc struct mem_cgroup *
+bpf_get_mem_cgroup(struct cgroup_subsys_state *css)
+{
+	struct mem_cgroup *memcg = NULL;
+	bool rcu_unlock = false;
+
+	if (!root_mem_cgroup)
+		return NULL;
+
+	if (root_mem_cgroup->css.ss != css->ss) {
+		struct cgroup *cgroup = css->cgroup;
+		int ssid = root_mem_cgroup->css.ss->id;
+
+		rcu_read_lock();
+		rcu_unlock = true;
+		css = rcu_dereference_raw(cgroup->subsys[ssid]);
+	}
+
+	if (css && css_tryget(css))
+		memcg = container_of(css, struct mem_cgroup, css);
+
+	if (rcu_unlock)
+		rcu_read_unlock();
+
+	return memcg;
+}
+
+/**
+ * bpf_put_mem_cgroup - Put a reference to a memory cgroup
+ * @memcg: memory cgroup to release
+ *
+ * Releases a previously acquired memcg reference.
+ * Implements KF_RELEASE semantics.
+ */
+__bpf_kfunc void bpf_put_mem_cgroup(struct mem_cgroup *memcg)
+{
+	css_put(&memcg->css);
+}
+
+/**
+ * bpf_mem_cgroup_events - Read memory cgroup's event counter
+ * @memcg: memory cgroup
+ * @event: event idx
+ *
+ * Allows to read memory cgroup event counters.
+ */
+__bpf_kfunc unsigned long bpf_mem_cgroup_events(struct mem_cgroup *memcg, int event)
+{
+
+	if (event < 0 || event >= NR_VM_EVENT_ITEMS)
+		return (unsigned long)-1;
+
+	return memcg_events(memcg, event);
+}
+
+/**
+ * bpf_mem_cgroup_usage - Read memory cgroup's usage
+ * @memcg: memory cgroup
+ *
+ * Returns current memory cgroup size in bytes.
+ */
+__bpf_kfunc unsigned long bpf_mem_cgroup_usage(struct mem_cgroup *memcg)
+{
+	return page_counter_read(&memcg->memory);
+}
+
+/**
+ * bpf_mem_cgroup_events - Read memory cgroup's page state counter
+ * @memcg: memory cgroup
+ * @event: event idx
+ *
+ * Allows to read memory cgroup statistics.
+ */
+__bpf_kfunc unsigned long bpf_mem_cgroup_page_state(struct mem_cgroup *memcg, int idx)
+{
+	if (idx < 0 || idx >= MEMCG_NR_STAT)
+		return (unsigned long)-1;
+
+	return memcg_page_state(memcg, idx);
+}
+
+/**
+ * bpf_mem_cgroup_flush_stats - Flush memory cgroup's statistics
+ * @memcg: memory cgroup
+ *
+ * Propagate memory cgroup's statistics up the cgroup tree.
+ *
+ * Note, that this function uses the rate-limited version of
+ * mem_cgroup_flush_stats() to avoid hurting the system-wide
+ * performance. So bpf_mem_cgroup_flush_stats() guarantees only
+ * that statistics is not stale beyond 2*FLUSH_TIME.
+ */
+__bpf_kfunc void bpf_mem_cgroup_flush_stats(struct mem_cgroup *memcg)
+{
+	mem_cgroup_flush_stats_ratelimited(memcg);
+}
+
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(bpf_memcontrol_kfuncs)
+BTF_ID_FLAGS(func, bpf_get_mem_cgroup, KF_ACQUIRE | KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_put_mem_cgroup, KF_RELEASE)
+
+BTF_ID_FLAGS(func, bpf_mem_cgroup_events, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_mem_cgroup_usage, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_mem_cgroup_page_state, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_mem_cgroup_flush_stats, KF_TRUSTED_ARGS)
+
+BTF_KFUNCS_END(bpf_memcontrol_kfuncs)
+
+static const struct btf_kfunc_id_set bpf_memcontrol_kfunc_set = {
+	.owner          = THIS_MODULE,
+	.set            = &bpf_memcontrol_kfuncs,
+};
+
+static int __init bpf_memcontrol_init(void)
+{
+	int err;
+
+	err = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
+					&bpf_memcontrol_kfunc_set);
+	if (err)
+		pr_warn("error while registering bpf memcontrol kfuncs: %d", err);
+
+	return err;
+}
+late_initcall(bpf_memcontrol_init);
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v1 05/14] mm: introduce bpf_get_root_mem_cgroup() bpf kfunc
  2025-08-18 17:01 [PATCH v1 00/14] mm: BPF OOM Roman Gushchin
                   ` (3 preceding siblings ...)
  2025-08-18 17:01 ` [PATCH v1 04/14] mm: introduce bpf kfuncs to deal with memcg pointers Roman Gushchin
@ 2025-08-18 17:01 ` Roman Gushchin
  2025-08-20  9:25   ` Kumar Kartikeya Dwivedi
  2025-08-18 17:01 ` [PATCH v1 06/14] mm: introduce bpf_out_of_memory() " Roman Gushchin
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 67+ messages in thread
From: Roman Gushchin @ 2025-08-18 17:01 UTC (permalink / raw)
  To: linux-mm, bpf
  Cc: Suren Baghdasaryan, Johannes Weiner, Michal Hocko, David Rientjes,
	Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel, Roman Gushchin

Introduce a bpf kfunc to get a trusted pointer to the root memory
cgroup. It's very handy to traverse the full memcg tree, e.g.
for handling a system-wide OOM.

It's possible to obtain this pointer by traversing the memcg tree
up from any known memcg, but it's sub-optimal and makes bpf programs
more complex and less efficient.

bpf_get_root_mem_cgroup() has a KF_ACQUIRE | KF_RET_NULL semantics,
however in reality it's not necessarily to bump the corresponding
reference counter - root memory cgroup is immortal, reference counting
is skipped, see css_get(). Once set, root_mem_cgroup is always a valid
memcg pointer. It's safe to call bpf_put_mem_cgroup() for the pointer
obtained with bpf_get_root_mem_cgroup(), it's effectively a no-op.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 mm/bpf_memcontrol.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/mm/bpf_memcontrol.c b/mm/bpf_memcontrol.c
index 66f2a359af7e..a8faa561bcba 100644
--- a/mm/bpf_memcontrol.c
+++ b/mm/bpf_memcontrol.c
@@ -10,6 +10,20 @@
 
 __bpf_kfunc_start_defs();
 
+/**
+ * bpf_get_root_mem_cgroup - Returns a pointer to the root memory cgroup
+ *
+ * The function has KF_ACQUIRE semantics, even though the root memory
+ * cgroup is never destroyed after being created and doesn't require
+ * reference counting. And it's perfectly safe to pass it to
+ * bpf_put_mem_cgroup()
+ */
+__bpf_kfunc struct mem_cgroup *bpf_get_root_mem_cgroup(void)
+{
+	/* css_get() is not needed */
+	return root_mem_cgroup;
+}
+
 /**
  * bpf_get_mem_cgroup - Get a reference to a memory cgroup
  * @css: pointer to the css structure
@@ -122,6 +136,7 @@ __bpf_kfunc void bpf_mem_cgroup_flush_stats(struct mem_cgroup *memcg)
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(bpf_memcontrol_kfuncs)
+BTF_ID_FLAGS(func, bpf_get_root_mem_cgroup, KF_ACQUIRE | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_get_mem_cgroup, KF_ACQUIRE | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_put_mem_cgroup, KF_RELEASE)
 
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v1 06/14] mm: introduce bpf_out_of_memory() bpf kfunc
  2025-08-18 17:01 [PATCH v1 00/14] mm: BPF OOM Roman Gushchin
                   ` (4 preceding siblings ...)
  2025-08-18 17:01 ` [PATCH v1 05/14] mm: introduce bpf_get_root_mem_cgroup() bpf kfunc Roman Gushchin
@ 2025-08-18 17:01 ` Roman Gushchin
  2025-08-19  4:09   ` Suren Baghdasaryan
  2025-08-20  9:34   ` Kumar Kartikeya Dwivedi
  2025-08-18 17:01 ` [PATCH v1 07/14] mm: allow specifying custom oom constraint for bpf triggers Roman Gushchin
                   ` (9 subsequent siblings)
  15 siblings, 2 replies; 67+ messages in thread
From: Roman Gushchin @ 2025-08-18 17:01 UTC (permalink / raw)
  To: linux-mm, bpf
  Cc: Suren Baghdasaryan, Johannes Weiner, Michal Hocko, David Rientjes,
	Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel, Roman Gushchin

Introduce bpf_out_of_memory() bpf kfunc, which allows to declare
an out of memory events and trigger the corresponding kernel OOM
handling mechanism.

It takes a trusted memcg pointer (or NULL for system-wide OOMs)
as an argument, as well as the page order.

If the wait_on_oom_lock argument is not set, only one OOM can be
declared and handled in the system at once, so if the function is
called in parallel to another OOM handling, it bails out with -EBUSY.
This mode is suited for global OOM's: any concurrent OOMs will likely
do the job and release some memory. In a blocking mode (which is
suited for memcg OOMs) the execution will wait on the oom_lock mutex.

The function is declared as sleepable. It guarantees that it won't
be called from an atomic context. It's required by the OOM handling
code, which is not guaranteed to work in a non-blocking context.

Handling of a memcg OOM almost always requires taking of the
css_set_lock spinlock. The fact that bpf_out_of_memory() is sleepable
also guarantees that it can't be called with acquired css_set_lock,
so the kernel can't deadlock on it.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 mm/oom_kill.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 25fc5e744e27..df409f0fac45 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1324,10 +1324,55 @@ __bpf_kfunc int bpf_oom_kill_process(struct oom_control *oc,
 	return 0;
 }
 
+/**
+ * bpf_out_of_memory - declare Out Of Memory state and invoke OOM killer
+ * @memcg__nullable: memcg or NULL for system-wide OOMs
+ * @order: order of page which wasn't allocated
+ * @wait_on_oom_lock: if true, block on oom_lock
+ * @constraint_text__nullable: custom constraint description for the OOM report
+ *
+ * Declares the Out Of Memory state and invokes the OOM killer.
+ *
+ * OOM handlers are synchronized using the oom_lock mutex. If wait_on_oom_lock
+ * is true, the function will wait on it. Otherwise it bails out with -EBUSY
+ * if oom_lock is contended.
+ *
+ * Generally it's advised to pass wait_on_oom_lock=true for global OOMs
+ * and wait_on_oom_lock=false for memcg-scoped OOMs.
+ *
+ * Returns 1 if the forward progress was achieved and some memory was freed.
+ * Returns a negative value if an error has been occurred.
+ */
+__bpf_kfunc int bpf_out_of_memory(struct mem_cgroup *memcg__nullable,
+				  int order, bool wait_on_oom_lock)
+{
+	struct oom_control oc = {
+		.memcg = memcg__nullable,
+		.order = order,
+	};
+	int ret;
+
+	if (oc.order < 0 || oc.order > MAX_PAGE_ORDER)
+		return -EINVAL;
+
+	if (wait_on_oom_lock) {
+		ret = mutex_lock_killable(&oom_lock);
+		if (ret)
+			return ret;
+	} else if (!mutex_trylock(&oom_lock))
+		return -EBUSY;
+
+	ret = out_of_memory(&oc);
+
+	mutex_unlock(&oom_lock);
+	return ret;
+}
+
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(bpf_oom_kfuncs)
 BTF_ID_FLAGS(func, bpf_oom_kill_process, KF_SLEEPABLE | KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_out_of_memory, KF_SLEEPABLE | KF_TRUSTED_ARGS)
 BTF_KFUNCS_END(bpf_oom_kfuncs)
 
 static const struct btf_kfunc_id_set bpf_oom_kfunc_set = {
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v1 07/14] mm: allow specifying custom oom constraint for bpf triggers
  2025-08-18 17:01 [PATCH v1 00/14] mm: BPF OOM Roman Gushchin
                   ` (5 preceding siblings ...)
  2025-08-18 17:01 ` [PATCH v1 06/14] mm: introduce bpf_out_of_memory() " Roman Gushchin
@ 2025-08-18 17:01 ` Roman Gushchin
  2025-08-18 17:01 ` [PATCH v1 08/14] mm: introduce bpf_task_is_oom_victim() kfunc Roman Gushchin
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 67+ messages in thread
From: Roman Gushchin @ 2025-08-18 17:01 UTC (permalink / raw)
  To: linux-mm, bpf
  Cc: Suren Baghdasaryan, Johannes Weiner, Michal Hocko, David Rientjes,
	Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel, Roman Gushchin

Currently there is a hard-coded list of possible oom constraints:
NONE, CPUSET, MEMORY_POLICY & MEMCG. Add a new one: CONSTRAINT_BPF.
Also, add an ability to specify a custom constraint name
when calling bpf_out_of_memory(). If an empty string is passed
as an argument, CONSTRAINT_BPF is displayed.

The resulting output in dmesg will look like this:

[  315.224875] kworker/u17:0 invoked oom-killer: gfp_mask=0x0(), order=0, oom_score_adj=0
               oom_policy=default
[  315.226532] CPU: 1 UID: 0 PID: 74 Comm: kworker/u17:0 Not tainted 6.16.0-00015-gf09eb0d6badc #102 PREEMPT(full)
[  315.226534] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-5.fc42 04/01/2014
[  315.226536] Workqueue: bpf_psi_wq bpf_psi_handle_event_fn
[  315.226542] Call Trace:
[  315.226545]  <TASK>
[  315.226548]  dump_stack_lvl+0x4d/0x70
[  315.226555]  dump_header+0x59/0x1c6
[  315.226561]  oom_kill_process.cold+0x8/0xef
[  315.226565]  out_of_memory+0x111/0x5c0
[  315.226577]  bpf_out_of_memory+0x6f/0xd0
[  315.226580]  ? srso_alias_return_thunk+0x5/0xfbef5
[  315.226589]  bpf_prog_3018b0cf55d2c6bb_handle_psi_event+0x5d/0x76
[  315.226594]  bpf__bpf_psi_ops_handle_psi_event+0x47/0xa7
[  315.226599]  bpf_psi_handle_event_fn+0x63/0xb0
[  315.226604]  process_one_work+0x1fc/0x580
[  315.226616]  ? srso_alias_return_thunk+0x5/0xfbef5
[  315.226624]  worker_thread+0x1d9/0x3b0
[  315.226629]  ? __pfx_worker_thread+0x10/0x10
[  315.226632]  kthread+0x128/0x270
[  315.226637]  ? lock_release+0xd4/0x2d0
[  315.226645]  ? __pfx_kthread+0x10/0x10
[  315.226649]  ret_from_fork+0x81/0xd0
[  315.226652]  ? __pfx_kthread+0x10/0x10
[  315.226655]  ret_from_fork_asm+0x1a/0x30
[  315.226667]  </TASK>
[  315.239745] memory: usage 42240kB, limit 9007199254740988kB, failcnt 0
[  315.240231] swap: usage 0kB, limit 0kB, failcnt 0
[  315.240585] Memory cgroup stats for /cgroup-test-work-dir673/oom_test/cg2:
[  315.240603] anon 42897408
[  315.241317] file 0
[  315.241493] kernel 98304
...
[  315.255946] Tasks state (memory values in pages):
[  315.256292] [  pid  ]   uid  tgid total_vm      rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
[  315.257107] [    675]     0   675   162013    10969    10712      257         0   155648        0             0 test_progs
[  315.257927] oom-kill:constraint=CONSTRAINT_BPF_PSI_MEM,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/cgroup-test-work-dir673/oom_test/cg2,task_memcg=/cgroup-test-work-dir673/oom_test/cg2,task=test_progs,pid=675,uid=0
[  315.259371] Memory cgroup out of memory: Killed process 675 (test_progs) total-vm:648052kB, anon-rss:42848kB, file-rss:1028kB, shmem-rss:0kB, UID:0 pgtables:152kB oom_score_adj:0

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 include/linux/oom.h |  4 ++++
 mm/oom_kill.c       | 38 +++++++++++++++++++++++++++++---------
 2 files changed, 33 insertions(+), 9 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index ef453309b7ea..4b04944b42de 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -19,6 +19,7 @@ enum oom_constraint {
 	CONSTRAINT_CPUSET,
 	CONSTRAINT_MEMORY_POLICY,
 	CONSTRAINT_MEMCG,
+	CONSTRAINT_BPF,
 };
 
 /*
@@ -58,6 +59,9 @@ struct oom_control {
 
 	/* Policy name */
 	const char *bpf_policy_name;
+
+	/* BPF-specific constraint name */
+	const char *bpf_constraint;
 #endif
 };
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index df409f0fac45..67afcd43a5f7 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -240,13 +240,6 @@ long oom_badness(struct task_struct *p, unsigned long totalpages)
 	return points;
 }
 
-static const char * const oom_constraint_text[] = {
-	[CONSTRAINT_NONE] = "CONSTRAINT_NONE",
-	[CONSTRAINT_CPUSET] = "CONSTRAINT_CPUSET",
-	[CONSTRAINT_MEMORY_POLICY] = "CONSTRAINT_MEMORY_POLICY",
-	[CONSTRAINT_MEMCG] = "CONSTRAINT_MEMCG",
-};
-
 static const char *oom_policy_name(struct oom_control *oc)
 {
 #ifdef CONFIG_BPF_SYSCALL
@@ -256,6 +249,27 @@ static const char *oom_policy_name(struct oom_control *oc)
 	return "default";
 }
 
+static const char *oom_constraint_text(struct oom_control *oc)
+{
+	switch (oc->constraint) {
+	case CONSTRAINT_NONE:
+		return "CONSTRAINT_NONE";
+	case CONSTRAINT_CPUSET:
+		return "CONSTRAINT_CPUSET";
+	case CONSTRAINT_MEMORY_POLICY:
+		return "CONSTRAINT_MEMORY_POLICY";
+	case CONSTRAINT_MEMCG:
+		return "CONSTRAINT_MEMCG";
+#ifdef CONFIG_BPF_SYSCALL
+	case CONSTRAINT_BPF:
+		return oc->bpf_constraint ? : "CONSTRAINT_BPF";
+#endif
+	default:
+		WARN_ON_ONCE(1);
+		return "";
+	}
+}
+
 /*
  * Determine the type of allocation constraint.
  */
@@ -267,6 +281,9 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
 	bool cpuset_limited = false;
 	int nid;
 
+	if (oc->constraint == CONSTRAINT_BPF)
+		return CONSTRAINT_BPF;
+
 	if (is_memcg_oom(oc)) {
 		oc->totalpages = mem_cgroup_get_max(oc->memcg) ?: 1;
 		return CONSTRAINT_MEMCG;
@@ -458,7 +475,7 @@ static void dump_oom_victim(struct oom_control *oc, struct task_struct *victim)
 {
 	/* one line summary of the oom killer context. */
 	pr_info("oom-kill:constraint=%s,nodemask=%*pbl",
-			oom_constraint_text[oc->constraint],
+			oom_constraint_text(oc),
 			nodemask_pr_args(oc->nodemask));
 	cpuset_print_current_mems_allowed();
 	mem_cgroup_print_oom_context(oc->memcg, victim);
@@ -1344,11 +1361,14 @@ __bpf_kfunc int bpf_oom_kill_process(struct oom_control *oc,
  * Returns a negative value if an error has been occurred.
  */
 __bpf_kfunc int bpf_out_of_memory(struct mem_cgroup *memcg__nullable,
-				  int order, bool wait_on_oom_lock)
+				  int order, bool wait_on_oom_lock,
+				  const char *constraint_text__nullable)
 {
 	struct oom_control oc = {
 		.memcg = memcg__nullable,
 		.order = order,
+		.constraint = CONSTRAINT_BPF,
+		.bpf_constraint = constraint_text__nullable,
 	};
 	int ret;
 
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v1 08/14] mm: introduce bpf_task_is_oom_victim() kfunc
  2025-08-18 17:01 [PATCH v1 00/14] mm: BPF OOM Roman Gushchin
                   ` (6 preceding siblings ...)
  2025-08-18 17:01 ` [PATCH v1 07/14] mm: allow specifying custom oom constraint for bpf triggers Roman Gushchin
@ 2025-08-18 17:01 ` Roman Gushchin
  2025-08-18 17:01 ` [PATCH v1 09/14] bpf: selftests: introduce read_cgroup_file() helper Roman Gushchin
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 67+ messages in thread
From: Roman Gushchin @ 2025-08-18 17:01 UTC (permalink / raw)
  To: linux-mm, bpf
  Cc: Suren Baghdasaryan, Johannes Weiner, Michal Hocko, David Rientjes,
	Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel, Roman Gushchin

Export tsk_is_oom_victim() helper as a bpf kfunc.
It's very useful to avoid redundant oom kills.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 mm/oom_kill.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 67afcd43a5f7..fe6e69dfbdba 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1388,11 +1388,25 @@ __bpf_kfunc int bpf_out_of_memory(struct mem_cgroup *memcg__nullable,
 	return ret;
 }
 
+/**
+ * bpf_task_is_oom_victim - Check if the task has been marked as an OOM victim
+ * @task: task to check
+ *
+ * Returns true if the task has been previously selected by the OOM killer
+ * to be killed. It's expected that the task will be destroyed soon and some
+ * memory will be freed, so maybe no additional actions required.
+ */
+__bpf_kfunc bool bpf_task_is_oom_victim(struct task_struct *task)
+{
+	return tsk_is_oom_victim(task);
+}
+
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(bpf_oom_kfuncs)
 BTF_ID_FLAGS(func, bpf_oom_kill_process, KF_SLEEPABLE | KF_TRUSTED_ARGS)
 BTF_ID_FLAGS(func, bpf_out_of_memory, KF_SLEEPABLE | KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_task_is_oom_victim, KF_TRUSTED_ARGS)
 BTF_KFUNCS_END(bpf_oom_kfuncs)
 
 static const struct btf_kfunc_id_set bpf_oom_kfunc_set = {
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v1 09/14] bpf: selftests: introduce read_cgroup_file() helper
  2025-08-18 17:01 [PATCH v1 00/14] mm: BPF OOM Roman Gushchin
                   ` (7 preceding siblings ...)
  2025-08-18 17:01 ` [PATCH v1 08/14] mm: introduce bpf_task_is_oom_victim() kfunc Roman Gushchin
@ 2025-08-18 17:01 ` Roman Gushchin
  2025-08-18 17:01 ` [PATCH v1 10/14] bpf: selftests: bpf OOM handler test Roman Gushchin
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 67+ messages in thread
From: Roman Gushchin @ 2025-08-18 17:01 UTC (permalink / raw)
  To: linux-mm, bpf
  Cc: Suren Baghdasaryan, Johannes Weiner, Michal Hocko, David Rientjes,
	Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel, Roman Gushchin

Implement read_cgroup_file() helper to read from cgroup control files,
e.g. statistics.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 tools/testing/selftests/bpf/cgroup_helpers.c | 39 ++++++++++++++++++++
 tools/testing/selftests/bpf/cgroup_helpers.h |  2 +
 2 files changed, 41 insertions(+)

diff --git a/tools/testing/selftests/bpf/cgroup_helpers.c b/tools/testing/selftests/bpf/cgroup_helpers.c
index e4535451322e..3ffd4b764f91 100644
--- a/tools/testing/selftests/bpf/cgroup_helpers.c
+++ b/tools/testing/selftests/bpf/cgroup_helpers.c
@@ -125,6 +125,45 @@ int enable_controllers(const char *relative_path, const char *controllers)
 	return __enable_controllers(cgroup_path, controllers);
 }
 
+static size_t __read_cgroup_file(const char *cgroup_path, const char *file,
+				 char *buf, size_t size)
+{
+	char file_path[PATH_MAX + 1];
+	size_t ret;
+	int fd;
+
+	snprintf(file_path, sizeof(file_path), "%s/%s", cgroup_path, file);
+	fd = open(file_path, O_RDONLY);
+	if (fd < 0) {
+		log_err("Opening %s", file_path);
+		return -1;
+	}
+
+	ret = read(fd, buf, size);
+	close(fd);
+	return ret;
+}
+
+/**
+ * read_cgroup_file() - Read to a cgroup file
+ * @relative_path: The cgroup path, relative to the workdir
+ * @file: The name of the file in cgroupfs to read to
+ * @buf: Buffer to read from the file
+ * @size: Size of the buffer
+ *
+ * Read to a file in the given cgroup's directory.
+ *
+ * If successful, the number of read bytes is returned.
+ */
+size_t read_cgroup_file(const char *relative_path, const char *file,
+			char *buf, size_t size)
+{
+	char cgroup_path[PATH_MAX - 24];
+
+	format_cgroup_path(cgroup_path, relative_path);
+	return __read_cgroup_file(cgroup_path, file, buf, size);
+}
+
 static int __write_cgroup_file(const char *cgroup_path, const char *file,
 			       const char *buf)
 {
diff --git a/tools/testing/selftests/bpf/cgroup_helpers.h b/tools/testing/selftests/bpf/cgroup_helpers.h
index 502845160d88..821cb76db1f7 100644
--- a/tools/testing/selftests/bpf/cgroup_helpers.h
+++ b/tools/testing/selftests/bpf/cgroup_helpers.h
@@ -11,6 +11,8 @@
 
 /* cgroupv2 related */
 int enable_controllers(const char *relative_path, const char *controllers);
+size_t read_cgroup_file(const char *relative_path, const char *file,
+			char *buf, size_t size);
 int write_cgroup_file(const char *relative_path, const char *file,
 		      const char *buf);
 int write_cgroup_file_parent(const char *relative_path, const char *file,
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v1 10/14] bpf: selftests: bpf OOM handler test
  2025-08-18 17:01 [PATCH v1 00/14] mm: BPF OOM Roman Gushchin
                   ` (8 preceding siblings ...)
  2025-08-18 17:01 ` [PATCH v1 09/14] bpf: selftests: introduce read_cgroup_file() helper Roman Gushchin
@ 2025-08-18 17:01 ` Roman Gushchin
  2025-08-20  9:33   ` Kumar Kartikeya Dwivedi
  2025-08-20 20:23   ` Andrii Nakryiko
  2025-08-18 17:01 ` [PATCH v1 11/14] sched: psi: refactor psi_trigger_create() Roman Gushchin
                   ` (5 subsequent siblings)
  15 siblings, 2 replies; 67+ messages in thread
From: Roman Gushchin @ 2025-08-18 17:01 UTC (permalink / raw)
  To: linux-mm, bpf
  Cc: Suren Baghdasaryan, Johannes Weiner, Michal Hocko, David Rientjes,
	Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel, Roman Gushchin

Implement a pseudo-realistic test for the OOM handling
functionality.

The OOM handling policy which is implemented in bpf is to
kill all tasks belonging to the biggest leaf cgroup, which
doesn't contain unkillable tasks (tasks with oom_score_adj
set to -1000). Pagecache size is excluded from the accounting.

The test creates a hierarchy of memory cgroups, causes an
OOM at the top level, checks that the expected process will be
killed and checks memcg's oom statistics.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 .../selftests/bpf/prog_tests/test_oom.c       | 229 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/test_oom.c  | 108 +++++++++
 2 files changed, 337 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_oom.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_oom.c

diff --git a/tools/testing/selftests/bpf/prog_tests/test_oom.c b/tools/testing/selftests/bpf/prog_tests/test_oom.c
new file mode 100644
index 000000000000..eaeb14a9d18f
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/test_oom.c
@@ -0,0 +1,229 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <test_progs.h>
+#include <bpf/btf.h>
+#include <bpf/bpf.h>
+
+#include "cgroup_helpers.h"
+#include "test_oom.skel.h"
+
+struct cgroup_desc {
+	const char *path;
+	int fd;
+	unsigned long long id;
+	int pid;
+	size_t target;
+	size_t max;
+	int oom_score_adj;
+	bool victim;
+};
+
+#define MB (1024 * 1024)
+#define OOM_SCORE_ADJ_MIN	(-1000)
+#define OOM_SCORE_ADJ_MAX	1000
+
+static struct cgroup_desc cgroups[] = {
+	{ .path = "/oom_test", .max = 80 * MB},
+	{ .path = "/oom_test/cg1", .target = 10 * MB,
+	  .oom_score_adj = OOM_SCORE_ADJ_MAX },
+	{ .path = "/oom_test/cg2", .target = 40 * MB,
+	  .oom_score_adj = OOM_SCORE_ADJ_MIN },
+	{ .path = "/oom_test/cg3" },
+	{ .path = "/oom_test/cg3/cg4", .target = 30 * MB,
+	  .victim = true },
+	{ .path = "/oom_test/cg3/cg5", .target = 20 * MB },
+};
+
+static int spawn_task(struct cgroup_desc *desc)
+{
+	char *ptr;
+	int pid;
+
+	pid = fork();
+	if (pid < 0)
+		return pid;
+
+	if (pid > 0) {
+		/* parent */
+		desc->pid = pid;
+		return 0;
+	}
+
+	/* child */
+	if (desc->oom_score_adj) {
+		char buf[64];
+		int fd = open("/proc/self/oom_score_adj", O_WRONLY);
+
+		if (fd < 0)
+			return -1;
+
+		snprintf(buf, sizeof(buf), "%d", desc->oom_score_adj);
+		write(fd, buf, sizeof(buf));
+		close(fd);
+	}
+
+	ptr = (char *)malloc(desc->target);
+	if (!ptr)
+		return -ENOMEM;
+
+	memset(ptr, 'a', desc->target);
+
+	while (1)
+		sleep(1000);
+
+	return 0;
+}
+
+static void setup_environment(void)
+{
+	int i, err;
+
+	err = setup_cgroup_environment();
+	if (!ASSERT_OK(err, "setup_cgroup_environment"))
+		goto cleanup;
+
+	for (i = 0; i < ARRAY_SIZE(cgroups); i++) {
+		cgroups[i].fd = create_and_get_cgroup(cgroups[i].path);
+		if (!ASSERT_GE(cgroups[i].fd, 0, "create_and_get_cgroup"))
+			goto cleanup;
+
+		cgroups[i].id = get_cgroup_id(cgroups[i].path);
+		if (!ASSERT_GT(cgroups[i].id, 0, "get_cgroup_id"))
+			goto cleanup;
+
+		/* Freeze the top-level cgroup */
+		if (i == 0) {
+			/* Freeze the top-level cgroup */
+			err = write_cgroup_file(cgroups[i].path, "cgroup.freeze", "1");
+			if (!ASSERT_OK(err, "freeze cgroup"))
+				goto cleanup;
+		}
+
+		/* Recursively enable the memory controller */
+		if (!cgroups[i].target) {
+
+			err = write_cgroup_file(cgroups[i].path, "cgroup.subtree_control",
+						"+memory");
+			if (!ASSERT_OK(err, "enable memory controller"))
+				goto cleanup;
+		}
+
+		/* Set memory.max */
+		if (cgroups[i].max) {
+			char buf[256];
+
+			snprintf(buf, sizeof(buf), "%lu", cgroups[i].max);
+			err = write_cgroup_file(cgroups[i].path, "memory.max", buf);
+			if (!ASSERT_OK(err, "set memory.max"))
+				goto cleanup;
+
+			snprintf(buf, sizeof(buf), "0");
+			write_cgroup_file(cgroups[i].path, "memory.swap.max", buf);
+
+		}
+
+		/* Spawn tasks creating memory pressure */
+		if (cgroups[i].target) {
+			char buf[256];
+
+			err = spawn_task(&cgroups[i]);
+			if (!ASSERT_OK(err, "spawn task"))
+				goto cleanup;
+
+			snprintf(buf, sizeof(buf), "%d", cgroups[i].pid);
+			err = write_cgroup_file(cgroups[i].path, "cgroup.procs", buf);
+			if (!ASSERT_OK(err, "put child into a cgroup"))
+				goto cleanup;
+		}
+	}
+
+	return;
+
+cleanup:
+	cleanup_cgroup_environment();
+}
+
+static int run_and_wait_for_oom(void)
+{
+	int ret = -1;
+	bool first = true;
+	char buf[4096] = {};
+	size_t size;
+
+	/* Unfreeze the top-level cgroup */
+	ret = write_cgroup_file(cgroups[0].path, "cgroup.freeze", "0");
+	if (!ASSERT_OK(ret, "freeze cgroup"))
+		return -1;
+
+	for (;;) {
+		int i, status;
+		pid_t pid = wait(&status);
+
+		if (pid == -1) {
+			if (errno == EINTR)
+				continue;
+			/* ECHILD */
+			break;
+		}
+
+		if (!first)
+			continue;
+
+		first = false;
+
+		/* Check which process was terminated first */
+		for (i = 0; i < ARRAY_SIZE(cgroups); i++) {
+			if (!ASSERT_OK(cgroups[i].victim !=
+				       (pid == cgroups[i].pid),
+				       "correct process was killed")) {
+				ret = -1;
+				break;
+			}
+
+			if (!cgroups[i].victim)
+				continue;
+
+			/* Check the memcg oom counter */
+			size = read_cgroup_file(cgroups[i].path,
+						"memory.events",
+						buf, sizeof(buf));
+			if (!ASSERT_OK(size <= 0, "read memory.events")) {
+				ret = -1;
+				break;
+			}
+
+			if (!ASSERT_OK(strstr(buf, "oom_kill 1") == NULL,
+				       "oom_kill count check")) {
+				ret = -1;
+				break;
+			}
+		}
+
+		/* Kill all remaining tasks */
+		for (i = 0; i < ARRAY_SIZE(cgroups); i++)
+			if (cgroups[i].pid && cgroups[i].pid != pid)
+				kill(cgroups[i].pid, SIGKILL);
+	}
+
+	return ret;
+}
+
+void test_oom(void)
+{
+	struct test_oom *skel;
+	int err;
+
+	setup_environment();
+
+	skel = test_oom__open_and_load();
+	err = test_oom__attach(skel);
+	if (CHECK_FAIL(err))
+		goto cleanup;
+
+	/* Unfreeze all child tasks and create the memory pressure */
+	err = run_and_wait_for_oom();
+	CHECK_FAIL(err);
+
+cleanup:
+	cleanup_cgroup_environment();
+	test_oom__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_oom.c b/tools/testing/selftests/bpf/progs/test_oom.c
new file mode 100644
index 000000000000..ca83563fc9a8
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_oom.c
@@ -0,0 +1,108 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+#define OOM_SCORE_ADJ_MIN	(-1000)
+
+void bpf_rcu_read_lock(void) __ksym;
+void bpf_rcu_read_unlock(void) __ksym;
+struct task_struct *bpf_task_acquire(struct task_struct *p) __ksym;
+void bpf_task_release(struct task_struct *p) __ksym;
+struct mem_cgroup *bpf_get_root_mem_cgroup(void) __ksym;
+struct mem_cgroup *bpf_get_mem_cgroup(struct cgroup_subsys_state *css) __ksym;
+void bpf_put_mem_cgroup(struct mem_cgroup *memcg) __ksym;
+int bpf_oom_kill_process(struct oom_control *oc, struct task_struct *task,
+			 const char *message__str) __ksym;
+
+static bool mem_cgroup_killable(struct mem_cgroup *memcg)
+{
+	struct task_struct *task;
+	bool ret = true;
+
+	bpf_for_each(css_task, task, &memcg->css, CSS_TASK_ITER_PROCS)
+		if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+			return false;
+
+	return ret;
+}
+
+/*
+ * Find the largest leaf cgroup (ignoring page cache) without unkillable tasks
+ * and kill all belonging tasks.
+ */
+SEC("struct_ops.s/handle_out_of_memory")
+int BPF_PROG(test_out_of_memory, struct oom_control *oc)
+{
+	struct task_struct *task;
+	struct mem_cgroup *root_memcg = oc->memcg;
+	struct mem_cgroup *memcg, *victim = NULL;
+	struct cgroup_subsys_state *css_pos;
+	unsigned long usage, max_usage = 0;
+	unsigned long pagecache = 0;
+	int ret = 0;
+
+	if (root_memcg)
+		root_memcg = bpf_get_mem_cgroup(&root_memcg->css);
+	else
+		root_memcg = bpf_get_root_mem_cgroup();
+
+	if (!root_memcg)
+		return 0;
+
+	bpf_rcu_read_lock();
+	bpf_for_each(css, css_pos, &root_memcg->css, BPF_CGROUP_ITER_DESCENDANTS_POST) {
+		if (css_pos->cgroup->nr_descendants + css_pos->cgroup->nr_dying_descendants)
+			continue;
+
+		memcg = bpf_get_mem_cgroup(css_pos);
+		if (!memcg)
+			continue;
+
+		usage = bpf_mem_cgroup_usage(memcg);
+		pagecache = bpf_mem_cgroup_page_state(memcg, NR_FILE_PAGES);
+
+		if (usage > pagecache)
+			usage -= pagecache;
+		else
+			usage = 0;
+
+		if ((usage > max_usage) && mem_cgroup_killable(memcg)) {
+			max_usage = usage;
+			if (victim)
+				bpf_put_mem_cgroup(victim);
+			victim = bpf_get_mem_cgroup(&memcg->css);
+		}
+
+		bpf_put_mem_cgroup(memcg);
+	}
+	bpf_rcu_read_unlock();
+
+	if (!victim)
+		goto exit;
+
+	bpf_for_each(css_task, task, &victim->css, CSS_TASK_ITER_PROCS) {
+		struct task_struct *t = bpf_task_acquire(task);
+
+		if (t) {
+			if (!bpf_task_is_oom_victim(task))
+				bpf_oom_kill_process(oc, task, "bpf oom test");
+			bpf_task_release(t);
+			ret = 1;
+		}
+	}
+
+	bpf_put_mem_cgroup(victim);
+exit:
+	bpf_put_mem_cgroup(root_memcg);
+
+	return ret;
+}
+
+SEC(".struct_ops.link")
+struct bpf_oom_ops test_bpf_oom = {
+	.name = "bpf_test_policy",
+	.handle_out_of_memory = (void *)test_out_of_memory,
+};
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v1 11/14] sched: psi: refactor psi_trigger_create()
  2025-08-18 17:01 [PATCH v1 00/14] mm: BPF OOM Roman Gushchin
                   ` (9 preceding siblings ...)
  2025-08-18 17:01 ` [PATCH v1 10/14] bpf: selftests: bpf OOM handler test Roman Gushchin
@ 2025-08-18 17:01 ` Roman Gushchin
  2025-08-19  4:09   ` Suren Baghdasaryan
  2025-08-18 17:01 ` [PATCH v1 12/14] sched: psi: implement psi trigger handling using bpf Roman Gushchin
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 67+ messages in thread
From: Roman Gushchin @ 2025-08-18 17:01 UTC (permalink / raw)
  To: linux-mm, bpf
  Cc: Suren Baghdasaryan, Johannes Weiner, Michal Hocko, David Rientjes,
	Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel, Roman Gushchin

Currently psi_trigger_create() does a lot of things:
parses the user text input, allocates and initializes
the psi_trigger structure and turns on the trigger.
It does it slightly different for two existing types
of psi_triggers: system-wide and cgroup-wide.

In order to support a new type of psi triggers, which
will be owned by a bpf program and won't have a user's
text description, let's refactor psi_trigger_create().

1. Introduce psi_trigger_type enum:
   currently PSI_SYSTEM and PSI_CGROUP are valid values.
2. Introduce psi_trigger_params structure to avoid passing
   a large number of parameters to psi_trigger_create().
3. Move out the user's input parsing into the new
   psi_trigger_parse() helper.
4. Move out the capabilities check into the new
   psi_file_privileged() helper.
5. Stop relying on t->of for detecting trigger type.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 include/linux/psi.h       | 15 +++++--
 include/linux/psi_types.h | 33 ++++++++++++++-
 kernel/cgroup/cgroup.c    | 14 ++++++-
 kernel/sched/psi.c        | 87 +++++++++++++++++++++++++--------------
 4 files changed, 112 insertions(+), 37 deletions(-)

diff --git a/include/linux/psi.h b/include/linux/psi.h
index e0745873e3f2..8178e998d94b 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -23,14 +23,23 @@ void psi_memstall_enter(unsigned long *flags);
 void psi_memstall_leave(unsigned long *flags);
 
 int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res);
-struct psi_trigger *psi_trigger_create(struct psi_group *group, char *buf,
-				       enum psi_res res, struct file *file,
-				       struct kernfs_open_file *of);
+int psi_trigger_parse(struct psi_trigger_params *params, const char *buf);
+struct psi_trigger *psi_trigger_create(struct psi_group *group,
+				const struct psi_trigger_params *param);
 void psi_trigger_destroy(struct psi_trigger *t);
 
 __poll_t psi_trigger_poll(void **trigger_ptr, struct file *file,
 			poll_table *wait);
 
+static inline bool psi_file_privileged(struct file *file)
+{
+	/*
+	 * Checking the privilege here on file->f_cred implies that a privileged user
+	 * could open the file and delegate the write to an unprivileged one.
+	 */
+	return cap_raised(file->f_cred->cap_effective, CAP_SYS_RESOURCE);
+}
+
 #ifdef CONFIG_CGROUPS
 static inline struct psi_group *cgroup_psi(struct cgroup *cgrp)
 {
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index f1fd3a8044e0..cea54121d9b9 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -121,7 +121,38 @@ struct psi_window {
 	u64 prev_growth;
 };
 
+enum psi_trigger_type {
+	PSI_SYSTEM,
+	PSI_CGROUP,
+};
+
+struct psi_trigger_params {
+	/* Trigger type */
+	enum psi_trigger_type type;
+
+	/* Resources that workloads could be stalled on */
+	enum psi_res res;
+
+	/* True if all threads should be stalled to trigger */
+	bool full;
+
+	/* Threshold in us */
+	u32 threshold_us;
+
+	/* Window in us */
+	u32 window_us;
+
+	/* Privileged triggers are treated differently */
+	bool privileged;
+
+	/* Link to kernfs open file, only for PSI_CGROUP */
+	struct kernfs_open_file *of;
+};
+
 struct psi_trigger {
+	/* Trigger type */
+	enum psi_trigger_type type;
+
 	/* PSI state being monitored by the trigger */
 	enum psi_states state;
 
@@ -137,7 +168,7 @@ struct psi_trigger {
 	/* Wait queue for polling */
 	wait_queue_head_t event_wait;
 
-	/* Kernfs file for cgroup triggers */
+	/* Kernfs file for PSI_CGROUP triggers */
 	struct kernfs_open_file *of;
 
 	/* Pending event flag */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index a723b7dc6e4e..9cd3c3a52c21 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3872,6 +3872,12 @@ static ssize_t pressure_write(struct kernfs_open_file *of, char *buf,
 	struct psi_trigger *new;
 	struct cgroup *cgrp;
 	struct psi_group *psi;
+	struct psi_trigger_params params;
+	int err;
+
+	err = psi_trigger_parse(&params, buf);
+	if (err)
+		return err;
 
 	cgrp = cgroup_kn_lock_live(of->kn, false);
 	if (!cgrp)
@@ -3887,7 +3893,13 @@ static ssize_t pressure_write(struct kernfs_open_file *of, char *buf,
 	}
 
 	psi = cgroup_psi(cgrp);
-	new = psi_trigger_create(psi, buf, res, of->file, of);
+
+	params.type = PSI_CGROUP;
+	params.res = res;
+	params.privileged = psi_file_privileged(of->file);
+	params.of = of;
+
+	new = psi_trigger_create(psi, &params);
 	if (IS_ERR(new)) {
 		cgroup_put(cgrp);
 		return PTR_ERR(new);
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index ad04a5c3162a..e1d8eaeeff17 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -489,7 +489,7 @@ static void update_triggers(struct psi_group *group, u64 now,
 
 		/* Generate an event */
 		if (cmpxchg(&t->event, 0, 1) == 0) {
-			if (t->of)
+			if (t->type == PSI_CGROUP)
 				kernfs_notify(t->of->kn);
 			else
 				wake_up_interruptible(&t->event_wait);
@@ -1281,74 +1281,87 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
 	return 0;
 }
 
-struct psi_trigger *psi_trigger_create(struct psi_group *group, char *buf,
-				       enum psi_res res, struct file *file,
-				       struct kernfs_open_file *of)
+int psi_trigger_parse(struct psi_trigger_params *params, const char *buf)
 {
-	struct psi_trigger *t;
-	enum psi_states state;
-	u32 threshold_us;
-	bool privileged;
-	u32 window_us;
+	u32 threshold_us, window_us;
 
 	if (static_branch_likely(&psi_disabled))
-		return ERR_PTR(-EOPNOTSUPP);
-
-	/*
-	 * Checking the privilege here on file->f_cred implies that a privileged user
-	 * could open the file and delegate the write to an unprivileged one.
-	 */
-	privileged = cap_raised(file->f_cred->cap_effective, CAP_SYS_RESOURCE);
+		return -EOPNOTSUPP;
 
 	if (sscanf(buf, "some %u %u", &threshold_us, &window_us) == 2)
-		state = PSI_IO_SOME + res * 2;
+		params->full = false;
 	else if (sscanf(buf, "full %u %u", &threshold_us, &window_us) == 2)
-		state = PSI_IO_FULL + res * 2;
+		params->full = true;
 	else
-		return ERR_PTR(-EINVAL);
+		return -EINVAL;
+
+	params->threshold_us = threshold_us;
+	params->window_us = window_us;
+	return 0;
+}
+
+struct psi_trigger *psi_trigger_create(struct psi_group *group,
+				       const struct psi_trigger_params *params)
+{
+	struct psi_trigger *t;
+	enum psi_states state;
+
+	if (static_branch_likely(&psi_disabled))
+		return ERR_PTR(-EOPNOTSUPP);
+
+	state = params->full ? PSI_IO_FULL : PSI_IO_SOME;
+	state += params->res * 2;
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
-	if (res == PSI_IRQ && --state != PSI_IRQ_FULL)
+	if (params->res == PSI_IRQ && --state != PSI_IRQ_FULL)
 		return ERR_PTR(-EINVAL);
 #endif
 
 	if (state >= PSI_NONIDLE)
 		return ERR_PTR(-EINVAL);
 
-	if (window_us == 0 || window_us > WINDOW_MAX_US)
+	if (params->window_us == 0 || params->window_us > WINDOW_MAX_US)
 		return ERR_PTR(-EINVAL);
 
 	/*
 	 * Unprivileged users can only use 2s windows so that averages aggregation
 	 * work is used, and no RT threads need to be spawned.
 	 */
-	if (!privileged && window_us % 2000000)
+	if (!params->privileged && params->window_us % 2000000)
 		return ERR_PTR(-EINVAL);
 
 	/* Check threshold */
-	if (threshold_us == 0 || threshold_us > window_us)
+	if (params->threshold_us == 0 || params->threshold_us > params->window_us)
 		return ERR_PTR(-EINVAL);
 
 	t = kmalloc(sizeof(*t), GFP_KERNEL);
 	if (!t)
 		return ERR_PTR(-ENOMEM);
 
+	t->type = params->type;
 	t->group = group;
 	t->state = state;
-	t->threshold = threshold_us * NSEC_PER_USEC;
-	t->win.size = window_us * NSEC_PER_USEC;
+	t->threshold = params->threshold_us * NSEC_PER_USEC;
+	t->win.size = params->window_us * NSEC_PER_USEC;
 	window_reset(&t->win, sched_clock(),
 			group->total[PSI_POLL][t->state], 0);
 
 	t->event = 0;
 	t->last_event_time = 0;
-	t->of = of;
-	if (!of)
+
+	switch (params->type) {
+	case PSI_SYSTEM:
 		init_waitqueue_head(&t->event_wait);
+		break;
+	case PSI_CGROUP:
+		t->of = params->of;
+		break;
+	}
+
 	t->pending_event = false;
-	t->aggregator = privileged ? PSI_POLL : PSI_AVGS;
+	t->aggregator = params->privileged ? PSI_POLL : PSI_AVGS;
 
-	if (privileged) {
+	if (params->privileged) {
 		mutex_lock(&group->rtpoll_trigger_lock);
 
 		if (!rcu_access_pointer(group->rtpoll_task)) {
@@ -1401,7 +1414,7 @@ void psi_trigger_destroy(struct psi_trigger *t)
 	 * being accessed later. Can happen if cgroup is deleted from under a
 	 * polling process.
 	 */
-	if (t->of)
+	if (t->type == PSI_CGROUP)
 		kernfs_notify(t->of->kn);
 	else
 		wake_up_interruptible(&t->event_wait);
@@ -1481,7 +1494,7 @@ __poll_t psi_trigger_poll(void **trigger_ptr,
 	if (!t)
 		return DEFAULT_POLLMASK | EPOLLERR | EPOLLPRI;
 
-	if (t->of)
+	if (t->type == PSI_CGROUP)
 		kernfs_generic_poll(t->of, wait);
 	else
 		poll_wait(file, &t->event_wait, wait);
@@ -1530,6 +1543,8 @@ static ssize_t psi_write(struct file *file, const char __user *user_buf,
 	size_t buf_size;
 	struct seq_file *seq;
 	struct psi_trigger *new;
+	struct psi_trigger_params params;
+	int err;
 
 	if (static_branch_likely(&psi_disabled))
 		return -EOPNOTSUPP;
@@ -1543,6 +1558,10 @@ static ssize_t psi_write(struct file *file, const char __user *user_buf,
 
 	buf[buf_size - 1] = '\0';
 
+	err = psi_trigger_parse(&params, buf);
+	if (err)
+		return err;
+
 	seq = file->private_data;
 
 	/* Take seq->lock to protect seq->private from concurrent writes */
@@ -1554,7 +1573,11 @@ static ssize_t psi_write(struct file *file, const char __user *user_buf,
 		return -EBUSY;
 	}
 
-	new = psi_trigger_create(&psi_system, buf, res, file, NULL);
+	params.type = PSI_SYSTEM;
+	params.res = res;
+	params.privileged = psi_file_privileged(file);
+
+	new = psi_trigger_create(&psi_system, &params);
 	if (IS_ERR(new)) {
 		mutex_unlock(&seq->lock);
 		return PTR_ERR(new);
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v1 12/14] sched: psi: implement psi trigger handling using bpf
  2025-08-18 17:01 [PATCH v1 00/14] mm: BPF OOM Roman Gushchin
                   ` (10 preceding siblings ...)
  2025-08-18 17:01 ` [PATCH v1 11/14] sched: psi: refactor psi_trigger_create() Roman Gushchin
@ 2025-08-18 17:01 ` Roman Gushchin
  2025-08-19  4:11   ` Suren Baghdasaryan
  2025-08-26 17:03   ` Amery Hung
  2025-08-18 17:01 ` [PATCH v1 13/14] sched: psi: implement bpf_psi_create_trigger() kfunc Roman Gushchin
                   ` (3 subsequent siblings)
  15 siblings, 2 replies; 67+ messages in thread
From: Roman Gushchin @ 2025-08-18 17:01 UTC (permalink / raw)
  To: linux-mm, bpf
  Cc: Suren Baghdasaryan, Johannes Weiner, Michal Hocko, David Rientjes,
	Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel, Roman Gushchin

This patch implements a bpf struct ops-based mechanism to create
psi triggers, attach them to cgroups or system wide and handle
psi events in bpf.

The struct ops provides 3 callbacks:
  - init() called once at load, handy for creating psi triggers
  - handle_psi_event() called every time a psi trigger fires
  - handle_cgroup_free() called if a cgroup with an attached
    trigger is being freed

A single struct ops can create a number of psi triggers, both
cgroup-scoped and system-wide.

All 3 struct ops callbacks can be sleepable. handle_psi_event()
handlers are executed using a separate workqueue, so it won't
affect the latency of other psi triggers.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 include/linux/bpf_psi.h      |  71 ++++++++++
 include/linux/psi_types.h    |  43 +++++-
 kernel/sched/bpf_psi.c       | 253 +++++++++++++++++++++++++++++++++++
 kernel/sched/build_utility.c |   4 +
 kernel/sched/psi.c           |  49 +++++--
 5 files changed, 408 insertions(+), 12 deletions(-)
 create mode 100644 include/linux/bpf_psi.h
 create mode 100644 kernel/sched/bpf_psi.c

diff --git a/include/linux/bpf_psi.h b/include/linux/bpf_psi.h
new file mode 100644
index 000000000000..826ab89ac11c
--- /dev/null
+++ b/include/linux/bpf_psi.h
@@ -0,0 +1,71 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+
+#ifndef __BPF_PSI_H
+#define __BPF_PSI_H
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/srcu.h>
+#include <linux/psi_types.h>
+
+struct cgroup;
+struct bpf_psi;
+struct psi_trigger;
+struct psi_trigger_params;
+
+#define BPF_PSI_FULL 0x80000000
+
+struct bpf_psi_ops {
+	/**
+	 * @init: Initialization callback, suited for creating psi triggers.
+	 * @bpf_psi: bpf_psi pointer, can be passed to bpf_psi_create_trigger().
+	 *
+	 * A non-0 return value means the initialization has been failed.
+	 */
+	int (*init)(struct bpf_psi *bpf_psi);
+
+	/**
+	 * @handle_psi_event: PSI event callback
+	 * @t: psi_trigger pointer
+	 */
+	void (*handle_psi_event)(struct psi_trigger *t);
+
+	/**
+	 * @handle_cgroup_free: Cgroup free callback
+	 * @cgroup_id: Id of freed cgroup
+	 *
+	 * Called every time a cgroup with an attached bpf psi trigger is freed.
+	 * No psi events can be raised after handle_cgroup_free().
+	 */
+	void (*handle_cgroup_free)(u64 cgroup_id);
+
+	/* private */
+	struct bpf_psi *bpf_psi;
+};
+
+struct bpf_psi {
+	spinlock_t lock;
+	struct list_head triggers;
+	struct bpf_psi_ops *ops;
+	struct srcu_struct srcu;
+};
+
+#ifdef CONFIG_BPF_SYSCALL
+void bpf_psi_add_trigger(struct psi_trigger *t,
+			 const struct psi_trigger_params *params);
+void bpf_psi_remove_trigger(struct psi_trigger *t);
+void bpf_psi_handle_event(struct psi_trigger *t);
+#ifdef CONFIG_CGROUPS
+void bpf_psi_cgroup_free(struct cgroup *cgroup);
+#endif
+
+#else /* CONFIG_BPF_SYSCALL */
+static inline void bpf_psi_add_trigger(struct psi_trigger *t,
+			const struct psi_trigger_params *params) {}
+static inline void bpf_psi_remove_trigger(struct psi_trigger *t) {}
+static inline void bpf_psi_handle_event(struct psi_trigger *t) {}
+static inline void bpf_psi_cgroup_free(struct cgroup *cgroup) {}
+
+#endif /* CONFIG_BPF_SYSCALL */
+
+#endif /* __BPF_PSI_H */
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index cea54121d9b9..f695cc34cfd4 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -124,6 +124,7 @@ struct psi_window {
 enum psi_trigger_type {
 	PSI_SYSTEM,
 	PSI_CGROUP,
+	PSI_BPF,
 };
 
 struct psi_trigger_params {
@@ -145,8 +146,15 @@ struct psi_trigger_params {
 	/* Privileged triggers are treated differently */
 	bool privileged;
 
-	/* Link to kernfs open file, only for PSI_CGROUP */
-	struct kernfs_open_file *of;
+	union {
+		/* Link to kernfs open file, only for PSI_CGROUP */
+		struct kernfs_open_file *of;
+
+#ifdef CONFIG_BPF_SYSCALL
+		/* Link to bpf_psi structure, only for BPF_PSI */
+		struct bpf_psi *bpf_psi;
+#endif
+	};
 };
 
 struct psi_trigger {
@@ -188,6 +196,31 @@ struct psi_trigger {
 
 	/* Trigger type - PSI_AVGS for unprivileged, PSI_POLL for RT */
 	enum psi_aggregators aggregator;
+
+#ifdef CONFIG_BPF_SYSCALL
+	/* Fields specific to PSI_BPF triggers */
+
+	/* Bpf psi structure for events handling */
+	struct bpf_psi *bpf_psi;
+
+	/* List node inside bpf_psi->triggers list */
+	struct list_head bpf_psi_node;
+
+	/* List node inside group->bpf_triggers list */
+	struct list_head bpf_group_node;
+
+	/* Work structure, used to execute event handlers */
+	struct work_struct bpf_work;
+
+	/*
+	 * Whether the trigger is being pinned in memory.
+	 * Protected by group->bpf_triggers_lock.
+	 */
+	bool pinned;
+
+	/* Cgroup Id */
+	u64 cgroup_id;
+#endif
 };
 
 struct psi_group {
@@ -236,6 +269,12 @@ struct psi_group {
 	u64 rtpoll_total[NR_PSI_STATES - 1];
 	u64 rtpoll_next_update;
 	u64 rtpoll_until;
+
+#ifdef CONFIG_BPF_SYSCALL
+	/* List of triggers owned by bpf and corresponding lock */
+	spinlock_t bpf_triggers_lock;
+	struct list_head bpf_triggers;
+#endif
 };
 
 #else /* CONFIG_PSI */
diff --git a/kernel/sched/bpf_psi.c b/kernel/sched/bpf_psi.c
new file mode 100644
index 000000000000..2ea9d7276b21
--- /dev/null
+++ b/kernel/sched/bpf_psi.c
@@ -0,0 +1,253 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * BPF PSI event handlers
+ *
+ * Author: Roman Gushchin <roman.gushchin@linux.dev>
+ */
+
+#include <linux/bpf_psi.h>
+#include <linux/cgroup-defs.h>
+
+static struct workqueue_struct *bpf_psi_wq;
+
+static struct bpf_psi *bpf_psi_create(struct bpf_psi_ops *ops)
+{
+	struct bpf_psi *bpf_psi;
+
+	bpf_psi = kzalloc(sizeof(*bpf_psi), GFP_KERNEL);
+	if (!bpf_psi)
+		return NULL;
+
+	if (init_srcu_struct(&bpf_psi->srcu)) {
+		kfree(bpf_psi);
+		return NULL;
+	}
+
+	spin_lock_init(&bpf_psi->lock);
+	bpf_psi->ops = ops;
+	INIT_LIST_HEAD(&bpf_psi->triggers);
+	ops->bpf_psi = bpf_psi;
+
+	return bpf_psi;
+}
+
+static void bpf_psi_free(struct bpf_psi *bpf_psi)
+{
+	cleanup_srcu_struct(&bpf_psi->srcu);
+	kfree(bpf_psi);
+}
+
+static void bpf_psi_handle_event_fn(struct work_struct *work)
+{
+	struct psi_trigger *t;
+	struct bpf_psi *bpf_psi;
+	int idx;
+
+	t = container_of(work, struct psi_trigger, bpf_work);
+	bpf_psi = READ_ONCE(t->bpf_psi);
+
+	if (likely(bpf_psi)) {
+		idx = srcu_read_lock(&bpf_psi->srcu);
+		if (bpf_psi->ops->handle_psi_event)
+			bpf_psi->ops->handle_psi_event(t);
+		srcu_read_unlock(&bpf_psi->srcu, idx);
+	}
+}
+
+void bpf_psi_add_trigger(struct psi_trigger *t,
+			 const struct psi_trigger_params *params)
+{
+	t->bpf_psi = params->bpf_psi;
+	t->pinned = false;
+	INIT_WORK(&t->bpf_work, bpf_psi_handle_event_fn);
+
+	spin_lock(&t->bpf_psi->lock);
+	list_add(&t->bpf_psi_node, &t->bpf_psi->triggers);
+	spin_unlock(&t->bpf_psi->lock);
+
+	spin_lock(&t->group->bpf_triggers_lock);
+	list_add(&t->bpf_group_node, &t->group->bpf_triggers);
+	spin_unlock(&t->group->bpf_triggers_lock);
+}
+
+void bpf_psi_remove_trigger(struct psi_trigger *t)
+{
+	spin_lock(&t->group->bpf_triggers_lock);
+	list_del(&t->bpf_group_node);
+	spin_unlock(&t->group->bpf_triggers_lock);
+
+	spin_lock(&t->bpf_psi->lock);
+	list_del(&t->bpf_psi_node);
+	spin_unlock(&t->bpf_psi->lock);
+}
+
+#ifdef CONFIG_CGROUPS
+void bpf_psi_cgroup_free(struct cgroup *cgroup)
+{
+	struct psi_group *group = cgroup->psi;
+	u64 cgrp_id = cgroup_id(cgroup);
+	struct psi_trigger *t, *p;
+	struct bpf_psi *bpf_psi;
+	LIST_HEAD(to_destroy);
+	int idx;
+
+	spin_lock(&group->bpf_triggers_lock);
+	list_for_each_entry_safe(t, p, &group->bpf_triggers, bpf_group_node) {
+		if (!t->pinned) {
+			t->pinned = true;
+			list_move(&t->bpf_group_node, &to_destroy);
+		}
+	}
+	spin_unlock(&group->bpf_triggers_lock);
+
+	list_for_each_entry_safe(t, p, &to_destroy, bpf_group_node) {
+		bpf_psi = READ_ONCE(t->bpf_psi);
+
+		idx = srcu_read_lock(&bpf_psi->srcu);
+		if (bpf_psi->ops->handle_cgroup_free)
+			bpf_psi->ops->handle_cgroup_free(cgrp_id);
+		srcu_read_unlock(&bpf_psi->srcu, idx);
+
+		spin_lock(&bpf_psi->lock);
+		list_del(&t->bpf_psi_node);
+		spin_unlock(&bpf_psi->lock);
+
+		WRITE_ONCE(t->bpf_psi, NULL);
+		flush_workqueue(bpf_psi_wq);
+		synchronize_srcu(&bpf_psi->srcu);
+		psi_trigger_destroy(t);
+	}
+}
+#endif
+
+void bpf_psi_handle_event(struct psi_trigger *t)
+{
+	queue_work(bpf_psi_wq, &t->bpf_work);
+}
+
+// bpf struct ops
+
+static int __bpf_psi_init(struct bpf_psi *bpf_psi) { return 0; }
+static void __bpf_psi_handle_psi_event(struct psi_trigger *t) {}
+static void __bpf_psi_handle_cgroup_free(u64 cgroup_id) {}
+
+static struct bpf_psi_ops __bpf_psi_ops = {
+	.init = __bpf_psi_init,
+	.handle_psi_event = __bpf_psi_handle_psi_event,
+	.handle_cgroup_free = __bpf_psi_handle_cgroup_free,
+};
+
+static const struct bpf_func_proto *
+bpf_psi_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	return tracing_prog_func_proto(func_id, prog);
+}
+
+static bool bpf_psi_ops_is_valid_access(int off, int size,
+					enum bpf_access_type type,
+					const struct bpf_prog *prog,
+					struct bpf_insn_access_aux *info)
+{
+	return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
+}
+
+static const struct bpf_verifier_ops bpf_psi_verifier_ops = {
+	.get_func_proto = bpf_psi_func_proto,
+	.is_valid_access = bpf_psi_ops_is_valid_access,
+};
+
+static int bpf_psi_ops_reg(void *kdata, struct bpf_link *link)
+{
+	struct bpf_psi_ops *ops = kdata;
+	struct bpf_psi *bpf_psi;
+
+	bpf_psi = bpf_psi_create(ops);
+	if (!bpf_psi)
+		return -ENOMEM;
+
+	return ops->init(bpf_psi);
+}
+
+static void bpf_psi_ops_unreg(void *kdata, struct bpf_link *link)
+{
+	struct bpf_psi_ops *ops = kdata;
+	struct bpf_psi *bpf_psi = ops->bpf_psi;
+	struct psi_trigger *t, *p;
+	LIST_HEAD(to_destroy);
+
+	spin_lock(&bpf_psi->lock);
+	list_for_each_entry_safe(t, p, &bpf_psi->triggers, bpf_psi_node) {
+		spin_lock(&t->group->bpf_triggers_lock);
+		if (!t->pinned) {
+			t->pinned = true;
+			list_move(&t->bpf_group_node, &to_destroy);
+			list_del(&t->bpf_psi_node);
+
+			WRITE_ONCE(t->bpf_psi, NULL);
+		}
+		spin_unlock(&t->group->bpf_triggers_lock);
+	}
+	spin_unlock(&bpf_psi->lock);
+
+	flush_workqueue(bpf_psi_wq);
+	synchronize_srcu(&bpf_psi->srcu);
+
+	list_for_each_entry_safe(t, p, &to_destroy, bpf_group_node)
+		psi_trigger_destroy(t);
+
+	bpf_psi_free(bpf_psi);
+}
+
+static int bpf_psi_ops_check_member(const struct btf_type *t,
+				    const struct btf_member *member,
+				    const struct bpf_prog *prog)
+{
+	return 0;
+}
+
+static int bpf_psi_ops_init_member(const struct btf_type *t,
+				   const struct btf_member *member,
+				   void *kdata, const void *udata)
+{
+	return 0;
+}
+
+static int bpf_psi_ops_init(struct btf *btf)
+{
+	return 0;
+}
+
+static struct bpf_struct_ops bpf_psi_bpf_ops = {
+	.verifier_ops = &bpf_psi_verifier_ops,
+	.reg = bpf_psi_ops_reg,
+	.unreg = bpf_psi_ops_unreg,
+	.check_member = bpf_psi_ops_check_member,
+	.init_member = bpf_psi_ops_init_member,
+	.init = bpf_psi_ops_init,
+	.name = "bpf_psi_ops",
+	.owner = THIS_MODULE,
+	.cfi_stubs = &__bpf_psi_ops
+};
+
+static int __init bpf_psi_struct_ops_init(void)
+{
+	int wq_flags = WQ_MEM_RECLAIM | WQ_UNBOUND | WQ_HIGHPRI;
+	int err;
+
+	bpf_psi_wq = alloc_workqueue("bpf_psi_wq", wq_flags, 0);
+	if (!bpf_psi_wq)
+		return -ENOMEM;
+
+	err = register_bpf_struct_ops(&bpf_psi_bpf_ops, bpf_psi_ops);
+	if (err) {
+		pr_warn("error while registering bpf psi struct ops: %d", err);
+		goto err;
+	}
+
+	return 0;
+
+err:
+	destroy_workqueue(bpf_psi_wq);
+	return err;
+}
+late_initcall(bpf_psi_struct_ops_init);
diff --git a/kernel/sched/build_utility.c b/kernel/sched/build_utility.c
index bf9d8db94b70..80f3799a2fa6 100644
--- a/kernel/sched/build_utility.c
+++ b/kernel/sched/build_utility.c
@@ -19,6 +19,7 @@
 #include <linux/sched/rseq_api.h>
 #include <linux/sched/task_stack.h>
 
+#include <linux/bpf_psi.h>
 #include <linux/cpufreq.h>
 #include <linux/cpumask_api.h>
 #include <linux/cpuset.h>
@@ -92,6 +93,9 @@
 
 #ifdef CONFIG_PSI
 # include "psi.c"
+# ifdef CONFIG_BPF_SYSCALL
+#  include "bpf_psi.c"
+# endif
 #endif
 
 #ifdef CONFIG_MEMBARRIER
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index e1d8eaeeff17..e10fbbc34099 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -201,6 +201,10 @@ static void group_init(struct psi_group *group)
 	init_waitqueue_head(&group->rtpoll_wait);
 	timer_setup(&group->rtpoll_timer, poll_timer_fn, 0);
 	rcu_assign_pointer(group->rtpoll_task, NULL);
+#ifdef CONFIG_BPF_SYSCALL
+	spin_lock_init(&group->bpf_triggers_lock);
+	INIT_LIST_HEAD(&group->bpf_triggers);
+#endif
 }
 
 void __init psi_init(void)
@@ -489,10 +493,17 @@ static void update_triggers(struct psi_group *group, u64 now,
 
 		/* Generate an event */
 		if (cmpxchg(&t->event, 0, 1) == 0) {
-			if (t->type == PSI_CGROUP)
-				kernfs_notify(t->of->kn);
-			else
+			switch (t->type) {
+			case PSI_SYSTEM:
 				wake_up_interruptible(&t->event_wait);
+				break;
+			case PSI_CGROUP:
+				kernfs_notify(t->of->kn);
+				break;
+			case PSI_BPF:
+				bpf_psi_handle_event(t);
+				break;
+			}
 		}
 		t->last_event_time = now;
 		/* Reset threshold breach flag once event got generated */
@@ -1125,6 +1136,7 @@ void psi_cgroup_free(struct cgroup *cgroup)
 		return;
 
 	cancel_delayed_work_sync(&cgroup->psi->avgs_work);
+	bpf_psi_cgroup_free(cgroup);
 	free_percpu(cgroup->psi->pcpu);
 	/* All triggers must be removed by now */
 	WARN_ONCE(cgroup->psi->rtpoll_states, "psi: trigger leak\n");
@@ -1356,6 +1368,9 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group,
 	case PSI_CGROUP:
 		t->of = params->of;
 		break;
+	case PSI_BPF:
+		bpf_psi_add_trigger(t, params);
+		break;
 	}
 
 	t->pending_event = false;
@@ -1369,8 +1384,10 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group,
 
 			task = kthread_create(psi_rtpoll_worker, group, "psimon");
 			if (IS_ERR(task)) {
-				kfree(t);
 				mutex_unlock(&group->rtpoll_trigger_lock);
+				if (t->type == PSI_BPF)
+					bpf_psi_remove_trigger(t);
+				kfree(t);
 				return ERR_CAST(task);
 			}
 			atomic_set(&group->rtpoll_wakeup, 0);
@@ -1414,10 +1431,16 @@ void psi_trigger_destroy(struct psi_trigger *t)
 	 * being accessed later. Can happen if cgroup is deleted from under a
 	 * polling process.
 	 */
-	if (t->type == PSI_CGROUP)
-		kernfs_notify(t->of->kn);
-	else
+	switch (t->type) {
+	case PSI_SYSTEM:
 		wake_up_interruptible(&t->event_wait);
+		break;
+	case PSI_CGROUP:
+		kernfs_notify(t->of->kn);
+		break;
+	case PSI_BPF:
+		break;
+	}
 
 	if (t->aggregator == PSI_AVGS) {
 		mutex_lock(&group->avgs_lock);
@@ -1494,10 +1517,16 @@ __poll_t psi_trigger_poll(void **trigger_ptr,
 	if (!t)
 		return DEFAULT_POLLMASK | EPOLLERR | EPOLLPRI;
 
-	if (t->type == PSI_CGROUP)
-		kernfs_generic_poll(t->of, wait);
-	else
+	switch (t->type) {
+	case PSI_SYSTEM:
 		poll_wait(file, &t->event_wait, wait);
+		break;
+	case PSI_CGROUP:
+		kernfs_generic_poll(t->of, wait);
+		break;
+	case PSI_BPF:
+		break;
+	}
 
 	if (cmpxchg(&t->event, 1, 0) == 1)
 		ret |= EPOLLPRI;
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v1 13/14] sched: psi: implement bpf_psi_create_trigger() kfunc
  2025-08-18 17:01 [PATCH v1 00/14] mm: BPF OOM Roman Gushchin
                   ` (11 preceding siblings ...)
  2025-08-18 17:01 ` [PATCH v1 12/14] sched: psi: implement psi trigger handling using bpf Roman Gushchin
@ 2025-08-18 17:01 ` Roman Gushchin
  2025-08-20 20:30   ` Andrii Nakryiko
  2025-08-18 17:01 ` [PATCH v1 14/14] bpf: selftests: psi struct ops test Roman Gushchin
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 67+ messages in thread
From: Roman Gushchin @ 2025-08-18 17:01 UTC (permalink / raw)
  To: linux-mm, bpf
  Cc: Suren Baghdasaryan, Johannes Weiner, Michal Hocko, David Rientjes,
	Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel, Roman Gushchin

Implement a new bpf_psi_create_trigger() bpf kfunc, which allows
to create new psi triggers and attach them to cgroups or be
system-wide.

Created triggers will exist until the struct ops is loaded and
if they are attached to a cgroup until the cgroup exists.

Due to a limitation of 5 arguments, the resource type and the "full"
bit are squeezed into a single u32.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 kernel/sched/bpf_psi.c | 84 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 84 insertions(+)

diff --git a/kernel/sched/bpf_psi.c b/kernel/sched/bpf_psi.c
index 2ea9d7276b21..94b684221708 100644
--- a/kernel/sched/bpf_psi.c
+++ b/kernel/sched/bpf_psi.c
@@ -156,6 +156,83 @@ static const struct bpf_verifier_ops bpf_psi_verifier_ops = {
 	.is_valid_access = bpf_psi_ops_is_valid_access,
 };
 
+__bpf_kfunc_start_defs();
+
+/**
+ * bpf_psi_create_trigger - Create a PSI trigger
+ * @bpf_psi: bpf_psi struct to attach the trigger to
+ * @cgroup_id: cgroup Id to attach the trigger; 0 for system-wide scope
+ * @resource: resource to monitor (PSI_MEM, PSI_IO, etc) and the full bit.
+ * @threshold_us: threshold in us
+ * @window_us: window in us
+ *
+ * Creates a PSI trigger and attached is to bpf_psi. The trigger will be
+ * active unless bpf struct ops is unloaded or the corresponding cgroup
+ * is deleted.
+ *
+ * Resource's most significant bit encodes whether "some" or "full"
+ * PSI state should be tracked.
+ *
+ * Returns 0 on success and the error code on failure.
+ */
+__bpf_kfunc int bpf_psi_create_trigger(struct bpf_psi *bpf_psi,
+				       u64 cgroup_id, u32 resource,
+				       u32 threshold_us, u32 window_us)
+{
+	enum psi_res res = resource & ~BPF_PSI_FULL;
+	bool full = resource & BPF_PSI_FULL;
+	struct psi_trigger_params params;
+	struct cgroup *cgroup __maybe_unused = NULL;
+	struct psi_group *group;
+	struct psi_trigger *t;
+	int ret = 0;
+
+	if (res >= NR_PSI_RESOURCES)
+		return -EINVAL;
+
+#ifdef CONFIG_CGROUPS
+	if (cgroup_id) {
+		cgroup = cgroup_get_from_id(cgroup_id);
+		if (IS_ERR_OR_NULL(cgroup))
+			return PTR_ERR(cgroup);
+
+		group = cgroup_psi(cgroup);
+	} else
+#endif
+		group = &psi_system;
+
+	params.type = PSI_BPF;
+	params.bpf_psi = bpf_psi;
+	params.privileged = capable(CAP_SYS_RESOURCE);
+	params.res = res;
+	params.full = full;
+	params.threshold_us = threshold_us;
+	params.window_us = window_us;
+
+	t = psi_trigger_create(group, &params);
+	if (IS_ERR(t))
+		ret = PTR_ERR(t);
+	else
+		t->cgroup_id = cgroup_id;
+
+#ifdef CONFIG_CGROUPS
+	if (cgroup)
+		cgroup_put(cgroup);
+#endif
+
+	return ret;
+}
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(bpf_psi_kfuncs)
+BTF_ID_FLAGS(func, bpf_psi_create_trigger, KF_TRUSTED_ARGS)
+BTF_KFUNCS_END(bpf_psi_kfuncs)
+
+static const struct btf_kfunc_id_set bpf_psi_kfunc_set = {
+	.owner          = THIS_MODULE,
+	.set            = &bpf_psi_kfuncs,
+};
+
 static int bpf_psi_ops_reg(void *kdata, struct bpf_link *link)
 {
 	struct bpf_psi_ops *ops = kdata;
@@ -238,6 +315,13 @@ static int __init bpf_psi_struct_ops_init(void)
 	if (!bpf_psi_wq)
 		return -ENOMEM;
 
+	err = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
+					&bpf_psi_kfunc_set);
+	if (err) {
+		pr_warn("error while registering bpf psi kfuncs: %d", err);
+		goto err;
+	}
+
 	err = register_bpf_struct_ops(&bpf_psi_bpf_ops, bpf_psi_ops);
 	if (err) {
 		pr_warn("error while registering bpf psi struct ops: %d", err);
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* [PATCH v1 14/14] bpf: selftests: psi struct ops test
  2025-08-18 17:01 [PATCH v1 00/14] mm: BPF OOM Roman Gushchin
                   ` (12 preceding siblings ...)
  2025-08-18 17:01 ` [PATCH v1 13/14] sched: psi: implement bpf_psi_create_trigger() kfunc Roman Gushchin
@ 2025-08-18 17:01 ` Roman Gushchin
  2025-08-19  4:08 ` [PATCH v1 00/14] mm: BPF OOM Suren Baghdasaryan
  2025-08-20 21:06 ` Shakeel Butt
  15 siblings, 0 replies; 67+ messages in thread
From: Roman Gushchin @ 2025-08-18 17:01 UTC (permalink / raw)
  To: linux-mm, bpf
  Cc: Suren Baghdasaryan, Johannes Weiner, Michal Hocko, David Rientjes,
	Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel, Roman Gushchin

Add a psi struct ops test.

The test creates a cgroup with two child sub-cgroups, sets up
memory.high for one of those and puts there a memory hungry
process (initially frozen).

Then it creates 2 psi triggers from within a init() bpf callback and
attaches them to these cgroups.  Then it deletes the first cgroup and
runs the memory hungry task.  The task is creating a high memory
pressure, which triggers the psi event. The psi bpf handler declares
a memcg oom in the corresponding cgroup.  Finally the checks that both
handle_cgroup_free() and handle_psi_event() handlers were executed,
the correct process was killed and oom counters were updated.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 .../selftests/bpf/prog_tests/test_psi.c       | 224 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/test_psi.c  |  76 ++++++
 2 files changed, 300 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_psi.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_psi.c

diff --git a/tools/testing/selftests/bpf/prog_tests/test_psi.c b/tools/testing/selftests/bpf/prog_tests/test_psi.c
new file mode 100644
index 000000000000..4f3c91bd6606
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/test_psi.c
@@ -0,0 +1,224 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <test_progs.h>
+#include <bpf/btf.h>
+#include <bpf/bpf.h>
+
+#include "cgroup_helpers.h"
+#include "test_psi.skel.h"
+
+enum psi_res {
+	PSI_IO,
+	PSI_MEM,
+	PSI_CPU,
+	PSI_IRQ,
+	NR_PSI_RESOURCES,
+};
+
+struct cgroup_desc {
+	const char *path;
+	unsigned long long id;
+	int pid;
+	int fd;
+	size_t target;
+	size_t high;
+	bool victim;
+};
+
+#define MB (1024 * 1024)
+
+static struct cgroup_desc cgroups[] = {
+	{ .path = "/oom_test" },
+	{ .path = "/oom_test/cg1" },
+	{ .path = "/oom_test/cg2", .target = 500 * MB,
+	  .high = 40 * MB, .victim = true },
+};
+
+static int spawn_task(struct cgroup_desc *desc)
+{
+	char *ptr;
+	int pid;
+
+	pid = fork();
+	if (pid < 0)
+		return pid;
+
+	if (pid > 0) {
+		/* parent */
+		desc->pid = pid;
+		return 0;
+	}
+
+	/* child */
+	ptr = (char *)malloc(desc->target);
+	if (!ptr)
+		return -ENOMEM;
+
+	memset(ptr, 'a', desc->target);
+
+	while (1)
+		sleep(1000);
+
+	return 0;
+}
+
+static void setup_environment(void)
+{
+	int i, err;
+
+	err = setup_cgroup_environment();
+	if (!ASSERT_OK(err, "setup_cgroup_environment"))
+		goto cleanup;
+
+	for (i = 0; i < ARRAY_SIZE(cgroups); i++) {
+		cgroups[i].fd = create_and_get_cgroup(cgroups[i].path);
+		if (!ASSERT_GE(cgroups[i].fd, 0, "create_and_get_cgroup"))
+			goto cleanup;
+
+		cgroups[i].id = get_cgroup_id(cgroups[i].path);
+		if (!ASSERT_GT(cgroups[i].id, 0, "get_cgroup_id"))
+			goto cleanup;
+
+		/* Freeze the top-level cgroup and enable the memory controller */
+		if (i == 0) {
+			err = write_cgroup_file(cgroups[i].path, "cgroup.freeze", "1");
+			if (!ASSERT_OK(err, "freeze cgroup"))
+				goto cleanup;
+
+			err = write_cgroup_file(cgroups[i].path, "cgroup.subtree_control",
+						"+memory");
+			if (!ASSERT_OK(err, "enable memory controller"))
+				goto cleanup;
+		}
+
+		/* Set memory.high */
+		if (cgroups[i].high) {
+			char buf[256];
+
+			snprintf(buf, sizeof(buf), "%lu", cgroups[i].high);
+			err = write_cgroup_file(cgroups[i].path, "memory.high", buf);
+			if (!ASSERT_OK(err, "set memory.high"))
+				goto cleanup;
+
+			snprintf(buf, sizeof(buf), "0");
+			write_cgroup_file(cgroups[i].path, "memory.swap.max", buf);
+		}
+
+		/* Spawn tasks creating memory pressure */
+		if (cgroups[i].target) {
+			char buf[256];
+
+			err = spawn_task(&cgroups[i]);
+			if (!ASSERT_OK(err, "spawn task"))
+				goto cleanup;
+
+			snprintf(buf, sizeof(buf), "%d", cgroups[i].pid);
+			err = write_cgroup_file(cgroups[i].path, "cgroup.procs", buf);
+			if (!ASSERT_OK(err, "put child into a cgroup"))
+				goto cleanup;
+		}
+	}
+
+	return;
+
+cleanup:
+	cleanup_cgroup_environment();
+}
+
+static int run_and_wait_for_oom(void)
+{
+	int ret = -1;
+	bool first = true;
+	char buf[4096] = {};
+	size_t size;
+
+	/* Unfreeze the top-level cgroup */
+	ret = write_cgroup_file(cgroups[0].path, "cgroup.freeze", "0");
+	if (!ASSERT_OK(ret, "unfreeze cgroup"))
+		return -1;
+
+	for (;;) {
+		int i, status;
+		pid_t pid = wait(&status);
+
+		if (pid == -1) {
+			if (errno == EINTR)
+				continue;
+			/* ECHILD */
+			break;
+		}
+
+		if (!first)
+			continue;
+		first = false;
+
+		/* Check which process was terminated first */
+		for (i = 0; i < ARRAY_SIZE(cgroups); i++) {
+			if (!ASSERT_OK(cgroups[i].victim !=
+				       (pid == cgroups[i].pid),
+				       "correct process was killed")) {
+				ret = -1;
+				break;
+			}
+
+			if (!cgroups[i].victim)
+				continue;
+
+			/* Check the memcg oom counter */
+			size = read_cgroup_file(cgroups[i].path, "memory.events",
+						buf, sizeof(buf));
+			if (!ASSERT_OK(size <= 0, "read memory.events")) {
+				ret = -1;
+				break;
+			}
+
+			if (!ASSERT_OK(strstr(buf, "oom_kill 1") == NULL,
+				       "oom_kill count check")) {
+				ret = -1;
+				break;
+			}
+		}
+
+		/* Kill all remaining tasks */
+		for (i = 0; i < ARRAY_SIZE(cgroups); i++)
+			if (cgroups[i].pid && cgroups[i].pid != pid)
+				kill(cgroups[i].pid, SIGKILL);
+	}
+
+	return ret;
+}
+
+void test_psi(void)
+{
+	struct test_psi *skel;
+	u64 freed_cgroup_id;
+	int err;
+
+	setup_environment();
+
+	skel = test_psi__open_and_load();
+	err = libbpf_get_error(skel);
+	if (CHECK_FAIL(err))
+		goto cleanup;
+
+	skel->bss->deleted_cgroup_id = cgroups[1].id;
+	skel->bss->high_pressure_cgroup_id = cgroups[2].id;
+
+	err = test_psi__attach(skel);
+	if (CHECK_FAIL(err))
+		goto cleanup;
+
+	/* Delete the first cgroup, it should trigger handle_cgroup_free() */
+	remove_cgroup(cgroups[1].path);
+
+	/* Unfreeze all child tasks and create the memory pressure */
+	err = run_and_wait_for_oom();
+	CHECK_FAIL(err);
+
+	/* Check the result of the handle_cgroup_free() handler */
+	freed_cgroup_id = skel->bss->deleted_cgroup_id;
+	ASSERT_EQ(freed_cgroup_id, cgroups[1].id, "freed cgroup id");
+
+cleanup:
+	cleanup_cgroup_environment();
+	test_psi__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_psi.c b/tools/testing/selftests/bpf/progs/test_psi.c
new file mode 100644
index 000000000000..2c36c05a3065
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_psi.c
@@ -0,0 +1,76 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+struct mem_cgroup *bpf_get_mem_cgroup(struct cgroup_subsys_state *css) __ksym;
+void bpf_put_mem_cgroup(struct mem_cgroup *memcg) __ksym;
+int bpf_out_of_memory(struct mem_cgroup *memcg, int order, bool wait_on_oom_lock,
+		      const char *constraint_text__nullable) __ksym;
+int bpf_psi_create_trigger(struct bpf_psi *bpf_psi, u64 cgroup_id,
+			   u32 res, u32 threshold_us, u32 window_us) __ksym;
+
+#define PSI_FULL 0x80000000
+
+/* cgroup which will experience the high memory pressure */
+u64 high_pressure_cgroup_id;
+
+/* cgroup which will be deleted */
+u64 deleted_cgroup_id;
+
+/* cgroup which was actually freed */
+u64 freed_cgroup_id;
+
+char constraint_name[] = "CONSTRAINT_BPF_PSI_MEM";
+
+SEC("struct_ops.s/init")
+int BPF_PROG(psi_init, struct bpf_psi *bpf_psi)
+{
+	int ret;
+
+	ret = bpf_psi_create_trigger(bpf_psi, high_pressure_cgroup_id,
+				     PSI_MEM | PSI_FULL, 100000, 1000000);
+	if (ret)
+		return ret;
+
+	return bpf_psi_create_trigger(bpf_psi, deleted_cgroup_id,
+				      PSI_IO, 100000, 1000000);
+}
+
+SEC("struct_ops.s/handle_psi_event")
+void BPF_PROG(handle_psi_event, struct psi_trigger *t)
+{
+	u64 cgroup_id = t->cgroup_id;
+	struct mem_cgroup *memcg;
+	struct cgroup *cgroup;
+
+	cgroup = bpf_cgroup_from_id(cgroup_id);
+	if (!cgroup)
+		return;
+
+	memcg = bpf_get_mem_cgroup(&cgroup->self);
+	if (!memcg) {
+		bpf_cgroup_release(cgroup);
+		return;
+	}
+
+	bpf_out_of_memory(memcg, 0, true, constraint_name);
+
+	bpf_put_mem_cgroup(memcg);
+	bpf_cgroup_release(cgroup);
+}
+
+SEC("struct_ops.s/handle_cgroup_free")
+void BPF_PROG(handle_cgroup_free, u64 cgroup_id)
+{
+	freed_cgroup_id = cgroup_id;
+}
+
+SEC(".struct_ops.link")
+struct bpf_psi_ops test_bpf_psi = {
+	.init = (void *)psi_init,
+	.handle_psi_event = (void *)handle_psi_event,
+	.handle_cgroup_free = (void *)handle_cgroup_free,
+};
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 00/14] mm: BPF OOM
  2025-08-18 17:01 [PATCH v1 00/14] mm: BPF OOM Roman Gushchin
                   ` (13 preceding siblings ...)
  2025-08-18 17:01 ` [PATCH v1 14/14] bpf: selftests: psi struct ops test Roman Gushchin
@ 2025-08-19  4:08 ` Suren Baghdasaryan
  2025-08-19 19:52   ` Roman Gushchin
  2025-08-20 21:06 ` Shakeel Butt
  15 siblings, 1 reply; 67+ messages in thread
From: Suren Baghdasaryan @ 2025-08-19  4:08 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, bpf, Johannes Weiner, Michal Hocko, David Rientjes,
	Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel

On Mon, Aug 18, 2025 at 10:01 AM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
>
> This patchset adds an ability to customize the out of memory
> handling using bpf.
>
> It focuses on two parts:
> 1) OOM handling policy,
> 2) PSI-based OOM invocation.
>
> The idea to use bpf for customizing the OOM handling is not new, but
> unlike the previous proposal [1], which augmented the existing task
> ranking policy, this one tries to be as generic as possible and
> leverage the full power of the modern bpf.
>
> It provides a generic interface which is called before the existing OOM
> killer code and allows implementing any policy, e.g. picking a victim
> task or memory cgroup or potentially even releasing memory in other
> ways, e.g. deleting tmpfs files (the last one might require some
> additional but relatively simple changes).
>
> The past attempt to implement memory-cgroup aware policy [2] showed
> that there are multiple opinions on what the best policy is.  As it's
> highly workload-dependent and specific to a concrete way of organizing
> workloads, the structure of the cgroup tree etc, a customizable
> bpf-based implementation is preferable over a in-kernel implementation
> with a dozen on sysctls.

s/on/of ?


>
> The second part is related to the fundamental question on when to
> declare the OOM event. It's a trade-off between the risk of
> unnecessary OOM kills and associated work losses and the risk of
> infinite trashing and effective soft lockups.  In the last few years
> several PSI-based userspace solutions were developed (e.g. OOMd [3] or
> systemd-OOMd [4]). The common idea was to use userspace daemons to
> implement custom OOM logic as well as rely on PSI monitoring to avoid
> stalls. In this scenario the userspace daemon was supposed to handle
> the majority of OOMs, while the in-kernel OOM killer worked as the
> last resort measure to guarantee that the system would never deadlock
> on the memory. But this approach creates additional infrastructure
> churn: userspace OOM daemon is a separate entity which needs to be
> deployed, updated, monitored. A completely different pipeline needs to
> be built to monitor both types of OOM events and collect associated
> logs. A userspace daemon is more restricted in terms on what data is
> available to it. Implementing a daemon which can work reliably under a
> heavy memory pressure in the system is also tricky.
>
> [1]: https://lwn.net/ml/linux-kernel/20230810081319.65668-1-zhouchuyi@bytedance.com/
> [2]: https://lore.kernel.org/lkml/20171130152824.1591-1-guro@fb.com/
> [3]: https://github.com/facebookincubator/oomd
> [4]: https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
>
> ----
>
> v1:
>   1) Both OOM and PSI parts are now implemented using bpf struct ops,
>      providing a path the future extensions (suggested by Kumar Kartikeya Dwivedi,
>      Song Liu and Matt Bobrowski)
>   2) It's possible to create PSI triggers from BPF, no need for an additional
>      userspace agent. (suggested by Suren Baghdasaryan)
>      Also there is now a callback for the cgroup release event.
>   3) Added an ability to block on oom_lock instead of bailing out (suggested by Michal Hocko)
>   4) Added bpf_task_is_oom_victim (suggested by Michal Hocko)
>   5) PSI callbacks are scheduled using a separate workqueue (suggested by Suren Baghdasaryan)
>
> RFC:
>   https://lwn.net/ml/all/20250428033617.3797686-1-roman.gushchin@linux.dev/
>
>
> Roman Gushchin (14):
>   mm: introduce bpf struct ops for OOM handling
>   bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL
>   mm: introduce bpf_oom_kill_process() bpf kfunc
>   mm: introduce bpf kfuncs to deal with memcg pointers
>   mm: introduce bpf_get_root_mem_cgroup() bpf kfunc
>   mm: introduce bpf_out_of_memory() bpf kfunc
>   mm: allow specifying custom oom constraint for bpf triggers
>   mm: introduce bpf_task_is_oom_victim() kfunc
>   bpf: selftests: introduce read_cgroup_file() helper
>   bpf: selftests: bpf OOM handler test
>   sched: psi: refactor psi_trigger_create()
>   sched: psi: implement psi trigger handling using bpf
>   sched: psi: implement bpf_psi_create_trigger() kfunc
>   bpf: selftests: psi struct ops test
>
>  include/linux/bpf_oom.h                       |  49 +++
>  include/linux/bpf_psi.h                       |  71 ++++
>  include/linux/memcontrol.h                    |   2 +
>  include/linux/oom.h                           |  12 +
>  include/linux/psi.h                           |  15 +-
>  include/linux/psi_types.h                     |  72 +++-
>  kernel/bpf/verifier.c                         |   5 +
>  kernel/cgroup/cgroup.c                        |  14 +-
>  kernel/sched/bpf_psi.c                        | 337 ++++++++++++++++++
>  kernel/sched/build_utility.c                  |   4 +
>  kernel/sched/psi.c                            | 130 +++++--
>  mm/Makefile                                   |   4 +
>  mm/bpf_memcontrol.c                           | 166 +++++++++
>  mm/bpf_oom.c                                  | 157 ++++++++
>  mm/oom_kill.c                                 | 182 +++++++++-
>  tools/testing/selftests/bpf/cgroup_helpers.c  |  39 ++
>  tools/testing/selftests/bpf/cgroup_helpers.h  |   2 +
>  .../selftests/bpf/prog_tests/test_oom.c       | 229 ++++++++++++
>  .../selftests/bpf/prog_tests/test_psi.c       | 224 ++++++++++++
>  tools/testing/selftests/bpf/progs/test_oom.c  | 108 ++++++
>  tools/testing/selftests/bpf/progs/test_psi.c  |  76 ++++
>  21 files changed, 1845 insertions(+), 53 deletions(-)
>  create mode 100644 include/linux/bpf_oom.h
>  create mode 100644 include/linux/bpf_psi.h
>  create mode 100644 kernel/sched/bpf_psi.c
>  create mode 100644 mm/bpf_memcontrol.c
>  create mode 100644 mm/bpf_oom.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/test_oom.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/test_psi.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_oom.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_psi.c
>
> --
> 2.50.1
>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 01/14] mm: introduce bpf struct ops for OOM handling
  2025-08-18 17:01 ` [PATCH v1 01/14] mm: introduce bpf struct ops for OOM handling Roman Gushchin
@ 2025-08-19  4:09   ` Suren Baghdasaryan
  2025-08-19 20:06     ` Roman Gushchin
  2025-08-20 11:28   ` Kumar Kartikeya Dwivedi
  2025-08-26 16:56   ` Amery Hung
  2 siblings, 1 reply; 67+ messages in thread
From: Suren Baghdasaryan @ 2025-08-19  4:09 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, bpf, Johannes Weiner, Michal Hocko, David Rientjes,
	Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel

On Mon, Aug 18, 2025 at 10:01 AM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
>
> Introduce a bpf struct ops for implementing custom OOM handling policies.
>
> The struct ops provides the bpf_handle_out_of_memory() callback,
> which expected to return 1 if it was able to free some memory and 0
> otherwise.
>
> In the latter case it's guaranteed that the in-kernel OOM killer will
> be invoked. Otherwise the kernel also checks the bpf_memory_freed
> field of the oom_control structure, which is expected to be set by
> kfuncs suitable for releasing memory. It's a safety mechanism which
> prevents a bpf program to claim forward progress without actually
> releasing memory. The callback program is sleepable to enable using
> iterators, e.g. cgroup iterators.
>
> The callback receives struct oom_control as an argument, so it can
> easily filter out OOM's it doesn't want to handle, e.g. global vs
> memcg OOM's.
>
> The callback is executed just before the kernel victim task selection
> algorithm, so all heuristics and sysctls like panic on oom,
> sysctl_oom_kill_allocating_task and sysctl_oom_kill_allocating_task
> are respected.
>
> The struct ops also has the name field, which allows to define a
> custom name for the implemented policy. It's printed in the OOM report
> in the oom_policy=<policy> format. "default" is printed if bpf is not
> used or policy name is not specified.
>
> [  112.696676] test_progs invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
>                oom_policy=bpf_test_policy
> [  112.698160] CPU: 1 UID: 0 PID: 660 Comm: test_progs Not tainted 6.16.0-00015-gf09eb0d6badc #102 PREEMPT(full)
> [  112.698165] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-5.fc42 04/01/2014
> [  112.698167] Call Trace:
> [  112.698177]  <TASK>
> [  112.698182]  dump_stack_lvl+0x4d/0x70
> [  112.698192]  dump_header+0x59/0x1c6
> [  112.698199]  oom_kill_process.cold+0x8/0xef
> [  112.698206]  bpf_oom_kill_process+0x59/0xb0
> [  112.698216]  bpf_prog_7ecad0f36a167fd7_test_out_of_memory+0x2be/0x313
> [  112.698229]  bpf__bpf_oom_ops_handle_out_of_memory+0x47/0xaf
> [  112.698236]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  112.698240]  bpf_handle_oom+0x11a/0x1e0
> [  112.698250]  out_of_memory+0xab/0x5c0
> [  112.698258]  mem_cgroup_out_of_memory+0xbc/0x110
> [  112.698274]  try_charge_memcg+0x4b5/0x7e0
> [  112.698288]  charge_memcg+0x2f/0xc0
> [  112.698293]  __mem_cgroup_charge+0x30/0xc0
> [  112.698299]  do_anonymous_page+0x40f/0xa50
> [  112.698311]  __handle_mm_fault+0xbba/0x1140
> [  112.698317]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  112.698335]  handle_mm_fault+0xe6/0x370
> [  112.698343]  do_user_addr_fault+0x211/0x6a0
> [  112.698354]  exc_page_fault+0x75/0x1d0
> [  112.698363]  asm_exc_page_fault+0x26/0x30
> [  112.698366] RIP: 0033:0x7fa97236db00
>
> It's possible to load multiple bpf struct programs. In the case of
> oom, they will be executed one by one in the same order they been
> loaded until one of them returns 1 and bpf_memory_freed is set to 1
> - an indication that the memory was freed. This allows to have
> multiple bpf programs to focus on different types of OOM's - e.g.
> one program can only handle memcg OOM's in one memory cgroup.
> But the filtering is done in bpf - so it's fully flexible.
>
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> ---
>  include/linux/bpf_oom.h |  49 +++++++++++++
>  include/linux/oom.h     |   8 ++
>  mm/Makefile             |   3 +
>  mm/bpf_oom.c            | 157 ++++++++++++++++++++++++++++++++++++++++
>  mm/oom_kill.c           |  22 +++++-
>  5 files changed, 237 insertions(+), 2 deletions(-)
>  create mode 100644 include/linux/bpf_oom.h
>  create mode 100644 mm/bpf_oom.c
>
> diff --git a/include/linux/bpf_oom.h b/include/linux/bpf_oom.h
> new file mode 100644
> index 000000000000..29cb5ea41d97
> --- /dev/null
> +++ b/include/linux/bpf_oom.h
> @@ -0,0 +1,49 @@
> +/* SPDX-License-Identifier: GPL-2.0+ */
> +
> +#ifndef __BPF_OOM_H
> +#define __BPF_OOM_H
> +
> +struct bpf_oom;
> +struct oom_control;
> +
> +#define BPF_OOM_NAME_MAX_LEN 64
> +
> +struct bpf_oom_ops {
> +       /**
> +        * @handle_out_of_memory: Out of memory bpf handler, called before
> +        * the in-kernel OOM killer.
> +        * @oc: OOM control structure
> +        *
> +        * Should return 1 if some memory was freed up, otherwise
> +        * the in-kernel OOM killer is invoked.
> +        */
> +       int (*handle_out_of_memory)(struct oom_control *oc);
> +
> +       /**
> +        * @name: BPF OOM policy name
> +        */
> +       char name[BPF_OOM_NAME_MAX_LEN];

Why should the name be a part of ops structure? IMO it's not an
attribute of the operations but rather of the oom handler which is
represented by bpf_oom here.

> +
> +       /* Private */
> +       struct bpf_oom *bpf_oom;
> +};
> +
> +#ifdef CONFIG_BPF_SYSCALL
> +/**
> + * @bpf_handle_oom: handle out of memory using bpf programs
> + * @oc: OOM control structure
> + *
> + * Returns true if a bpf oom program was executed, returned 1
> + * and some memory was actually freed.

The above comment is unclear, please clarify.

> + */
> +bool bpf_handle_oom(struct oom_control *oc);
> +
> +#else /* CONFIG_BPF_SYSCALL */
> +static inline bool bpf_handle_oom(struct oom_control *oc)
> +{
> +       return false;
> +}
> +
> +#endif /* CONFIG_BPF_SYSCALL */
> +
> +#endif /* __BPF_OOM_H */
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 1e0fc6931ce9..ef453309b7ea 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -51,6 +51,14 @@ struct oom_control {
>
>         /* Used to print the constraint info. */
>         enum oom_constraint constraint;
> +
> +#ifdef CONFIG_BPF_SYSCALL
> +       /* Used by the bpf oom implementation to mark the forward progress */
> +       bool bpf_memory_freed;
> +
> +       /* Policy name */
> +       const char *bpf_policy_name;
> +#endif
>  };
>
>  extern struct mutex oom_lock;
> diff --git a/mm/Makefile b/mm/Makefile
> index 1a7a11d4933d..a714aba03759 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -105,6 +105,9 @@ obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
>  ifdef CONFIG_SWAP
>  obj-$(CONFIG_MEMCG) += swap_cgroup.o
>  endif
> +ifdef CONFIG_BPF_SYSCALL
> +obj-y += bpf_oom.o
> +endif
>  obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
>  obj-$(CONFIG_GUP_TEST) += gup_test.o
>  obj-$(CONFIG_DMAPOOL_TEST) += dmapool_test.o
> diff --git a/mm/bpf_oom.c b/mm/bpf_oom.c
> new file mode 100644
> index 000000000000..47633046819c
> --- /dev/null
> +++ b/mm/bpf_oom.c
> @@ -0,0 +1,157 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * BPF-driven OOM killer customization
> + *
> + * Author: Roman Gushchin <roman.gushchin@linux.dev>
> + */
> +
> +#include <linux/bpf.h>
> +#include <linux/oom.h>
> +#include <linux/bpf_oom.h>
> +#include <linux/srcu.h>
> +
> +DEFINE_STATIC_SRCU(bpf_oom_srcu);
> +static DEFINE_SPINLOCK(bpf_oom_lock);
> +static LIST_HEAD(bpf_oom_handlers);
> +
> +struct bpf_oom {

Perhaps bpf_oom_handler ? Then bpf_oom_ops->bpf_oom could be called
bpf_oom_ops->handler.


> +       struct bpf_oom_ops *ops;
> +       struct list_head node;
> +       struct srcu_struct srcu;
> +};
> +
> +bool bpf_handle_oom(struct oom_control *oc)
> +{
> +       struct bpf_oom_ops *ops;
> +       struct bpf_oom *bpf_oom;
> +       int list_idx, idx, ret = 0;
> +
> +       oc->bpf_memory_freed = false;
> +
> +       list_idx = srcu_read_lock(&bpf_oom_srcu);
> +       list_for_each_entry_srcu(bpf_oom, &bpf_oom_handlers, node, false) {
> +               ops = READ_ONCE(bpf_oom->ops);
> +               if (!ops || !ops->handle_out_of_memory)
> +                       continue;
> +               idx = srcu_read_lock(&bpf_oom->srcu);
> +               oc->bpf_policy_name = ops->name[0] ? &ops->name[0] :
> +                       "bpf_defined_policy";
> +               ret = ops->handle_out_of_memory(oc);
> +               oc->bpf_policy_name = NULL;
> +               srcu_read_unlock(&bpf_oom->srcu, idx);
> +
> +               if (ret && oc->bpf_memory_freed)

IIUC ret and oc->bpf_memory_freed seem to reflect the same state:
handler successfully freed some memory. Could you please clarify when
they differ?



> +                       break;
> +       }
> +       srcu_read_unlock(&bpf_oom_srcu, list_idx);
> +
> +       return ret && oc->bpf_memory_freed;
> +}
> +
> +static int __handle_out_of_memory(struct oom_control *oc)
> +{
> +       return 0;
> +}
> +
> +static struct bpf_oom_ops __bpf_oom_ops = {
> +       .handle_out_of_memory = __handle_out_of_memory,
> +};
> +
> +static const struct bpf_func_proto *
> +bpf_oom_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> +{
> +       return tracing_prog_func_proto(func_id, prog);
> +}
> +
> +static bool bpf_oom_ops_is_valid_access(int off, int size,
> +                                       enum bpf_access_type type,
> +                                       const struct bpf_prog *prog,
> +                                       struct bpf_insn_access_aux *info)
> +{
> +       return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
> +}
> +
> +static const struct bpf_verifier_ops bpf_oom_verifier_ops = {
> +       .get_func_proto = bpf_oom_func_proto,
> +       .is_valid_access = bpf_oom_ops_is_valid_access,
> +};
> +
> +static int bpf_oom_ops_reg(void *kdata, struct bpf_link *link)
> +{
> +       struct bpf_oom_ops *ops = kdata;
> +       struct bpf_oom *bpf_oom;
> +       int ret;
> +
> +       bpf_oom = kmalloc(sizeof(*bpf_oom), GFP_KERNEL_ACCOUNT);
> +       if (!bpf_oom)
> +               return -ENOMEM;
> +
> +       ret = init_srcu_struct(&bpf_oom->srcu);
> +       if (ret) {
> +               kfree(bpf_oom);
> +               return ret;
> +       }
> +
> +       WRITE_ONCE(bpf_oom->ops, ops);
> +       ops->bpf_oom = bpf_oom;
> +
> +       spin_lock(&bpf_oom_lock);
> +       list_add_rcu(&bpf_oom->node, &bpf_oom_handlers);
> +       spin_unlock(&bpf_oom_lock);
> +
> +       return 0;
> +}
> +
> +static void bpf_oom_ops_unreg(void *kdata, struct bpf_link *link)
> +{
> +       struct bpf_oom_ops *ops = kdata;
> +       struct bpf_oom *bpf_oom = ops->bpf_oom;
> +
> +       WRITE_ONCE(bpf_oom->ops, NULL);
> +
> +       spin_lock(&bpf_oom_lock);
> +       list_del_rcu(&bpf_oom->node);
> +       spin_unlock(&bpf_oom_lock);
> +
> +       synchronize_srcu(&bpf_oom->srcu);
> +
> +       kfree(bpf_oom);
> +}
> +
> +static int bpf_oom_ops_init_member(const struct btf_type *t,
> +                                  const struct btf_member *member,
> +                                  void *kdata, const void *udata)
> +{
> +       const struct bpf_oom_ops *uops = (const struct bpf_oom_ops *)udata;
> +       struct bpf_oom_ops *ops = (struct bpf_oom_ops *)kdata;
> +       u32 moff = __btf_member_bit_offset(t, member) / 8;
> +
> +       switch (moff) {
> +       case offsetof(struct bpf_oom_ops, name):
> +               strscpy_pad(ops->name, uops->name, sizeof(ops->name));
> +               return 1;
> +       }
> +       return 0;
> +}
> +
> +static int bpf_oom_ops_init(struct btf *btf)
> +{
> +       return 0;
> +}
> +
> +static struct bpf_struct_ops bpf_oom_bpf_ops = {
> +       .verifier_ops = &bpf_oom_verifier_ops,
> +       .reg = bpf_oom_ops_reg,
> +       .unreg = bpf_oom_ops_unreg,
> +       .init_member = bpf_oom_ops_init_member,
> +       .init = bpf_oom_ops_init,
> +       .name = "bpf_oom_ops",
> +       .owner = THIS_MODULE,
> +       .cfi_stubs = &__bpf_oom_ops
> +};
> +
> +static int __init bpf_oom_struct_ops_init(void)
> +{
> +       return register_bpf_struct_ops(&bpf_oom_bpf_ops, bpf_oom_ops);
> +}
> +late_initcall(bpf_oom_struct_ops_init);
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 25923cfec9c6..ad7bd65061d6 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -45,6 +45,7 @@
>  #include <linux/mmu_notifier.h>
>  #include <linux/cred.h>
>  #include <linux/nmi.h>
> +#include <linux/bpf_oom.h>
>
>  #include <asm/tlb.h>
>  #include "internal.h"
> @@ -246,6 +247,15 @@ static const char * const oom_constraint_text[] = {
>         [CONSTRAINT_MEMCG] = "CONSTRAINT_MEMCG",
>  };
>
> +static const char *oom_policy_name(struct oom_control *oc)
> +{
> +#ifdef CONFIG_BPF_SYSCALL
> +       if (oc->bpf_policy_name)
> +               return oc->bpf_policy_name;
> +#endif
> +       return "default";
> +}
> +
>  /*
>   * Determine the type of allocation constraint.
>   */
> @@ -458,9 +468,10 @@ static void dump_oom_victim(struct oom_control *oc, struct task_struct *victim)
>
>  static void dump_header(struct oom_control *oc)
>  {
> -       pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), order=%d, oom_score_adj=%hd\n",
> +       pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), order=%d, oom_score_adj=%hd\noom_policy=%s\n",
>                 current->comm, oc->gfp_mask, &oc->gfp_mask, oc->order,
> -                       current->signal->oom_score_adj);
> +               current->signal->oom_score_adj,
> +               oom_policy_name(oc));
>         if (!IS_ENABLED(CONFIG_COMPACTION) && oc->order)
>                 pr_warn("COMPACTION is disabled!!!\n");
>
> @@ -1161,6 +1172,13 @@ bool out_of_memory(struct oom_control *oc)
>                 return true;
>         }
>
> +       /*
> +        * Let bpf handle the OOM first. If it was able to free up some memory,
> +        * bail out. Otherwise fall back to the kernel OOM killer.
> +        */
> +       if (bpf_handle_oom(oc))
> +               return true;
> +
>         select_bad_process(oc);
>         /* Found nothing?!?! */
>         if (!oc->chosen) {
> --
> 2.50.1
>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 06/14] mm: introduce bpf_out_of_memory() bpf kfunc
  2025-08-18 17:01 ` [PATCH v1 06/14] mm: introduce bpf_out_of_memory() " Roman Gushchin
@ 2025-08-19  4:09   ` Suren Baghdasaryan
  2025-08-19 20:16     ` Roman Gushchin
  2025-08-20  9:34   ` Kumar Kartikeya Dwivedi
  1 sibling, 1 reply; 67+ messages in thread
From: Suren Baghdasaryan @ 2025-08-19  4:09 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, bpf, Johannes Weiner, Michal Hocko, David Rientjes,
	Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel

On Mon, Aug 18, 2025 at 10:02 AM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
>
> Introduce bpf_out_of_memory() bpf kfunc, which allows to declare
> an out of memory events and trigger the corresponding kernel OOM
> handling mechanism.
>
> It takes a trusted memcg pointer (or NULL for system-wide OOMs)
> as an argument, as well as the page order.
>
> If the wait_on_oom_lock argument is not set, only one OOM can be
> declared and handled in the system at once, so if the function is
> called in parallel to another OOM handling, it bails out with -EBUSY.
> This mode is suited for global OOM's: any concurrent OOMs will likely
> do the job and release some memory. In a blocking mode (which is
> suited for memcg OOMs) the execution will wait on the oom_lock mutex.
>
> The function is declared as sleepable. It guarantees that it won't
> be called from an atomic context. It's required by the OOM handling
> code, which is not guaranteed to work in a non-blocking context.
>
> Handling of a memcg OOM almost always requires taking of the
> css_set_lock spinlock. The fact that bpf_out_of_memory() is sleepable
> also guarantees that it can't be called with acquired css_set_lock,
> so the kernel can't deadlock on it.
>
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> ---
>  mm/oom_kill.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 45 insertions(+)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 25fc5e744e27..df409f0fac45 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1324,10 +1324,55 @@ __bpf_kfunc int bpf_oom_kill_process(struct oom_control *oc,
>         return 0;
>  }
>
> +/**
> + * bpf_out_of_memory - declare Out Of Memory state and invoke OOM killer
> + * @memcg__nullable: memcg or NULL for system-wide OOMs
> + * @order: order of page which wasn't allocated
> + * @wait_on_oom_lock: if true, block on oom_lock
> + * @constraint_text__nullable: custom constraint description for the OOM report
> + *
> + * Declares the Out Of Memory state and invokes the OOM killer.
> + *
> + * OOM handlers are synchronized using the oom_lock mutex. If wait_on_oom_lock
> + * is true, the function will wait on it. Otherwise it bails out with -EBUSY
> + * if oom_lock is contended.
> + *
> + * Generally it's advised to pass wait_on_oom_lock=true for global OOMs
> + * and wait_on_oom_lock=false for memcg-scoped OOMs.

From the changelog description I was under impression that it's vice
versa, for global OOMs you would not block (wait_on_oom_lock=false),
for memcg ones you would (wait_on_oom_lock=true).

> + *
> + * Returns 1 if the forward progress was achieved and some memory was freed.
> + * Returns a negative value if an error has been occurred.

s/has been occurred/has occurred or occured


> + */
> +__bpf_kfunc int bpf_out_of_memory(struct mem_cgroup *memcg__nullable,
> +                                 int order, bool wait_on_oom_lock)
> +{
> +       struct oom_control oc = {
> +               .memcg = memcg__nullable,
> +               .order = order,
> +       };
> +       int ret;
> +
> +       if (oc.order < 0 || oc.order > MAX_PAGE_ORDER)
> +               return -EINVAL;
> +
> +       if (wait_on_oom_lock) {
> +               ret = mutex_lock_killable(&oom_lock);
> +               if (ret)
> +                       return ret;
> +       } else if (!mutex_trylock(&oom_lock))
> +               return -EBUSY;
> +
> +       ret = out_of_memory(&oc);
> +
> +       mutex_unlock(&oom_lock);
> +       return ret;
> +}
> +
>  __bpf_kfunc_end_defs();
>
>  BTF_KFUNCS_START(bpf_oom_kfuncs)
>  BTF_ID_FLAGS(func, bpf_oom_kill_process, KF_SLEEPABLE | KF_TRUSTED_ARGS)
> +BTF_ID_FLAGS(func, bpf_out_of_memory, KF_SLEEPABLE | KF_TRUSTED_ARGS)
>  BTF_KFUNCS_END(bpf_oom_kfuncs)
>
>  static const struct btf_kfunc_id_set bpf_oom_kfunc_set = {
> --
> 2.50.1
>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 11/14] sched: psi: refactor psi_trigger_create()
  2025-08-18 17:01 ` [PATCH v1 11/14] sched: psi: refactor psi_trigger_create() Roman Gushchin
@ 2025-08-19  4:09   ` Suren Baghdasaryan
  2025-08-19 20:28     ` Roman Gushchin
  0 siblings, 1 reply; 67+ messages in thread
From: Suren Baghdasaryan @ 2025-08-19  4:09 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, bpf, Johannes Weiner, Michal Hocko, David Rientjes,
	Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel

On Mon, Aug 18, 2025 at 10:02 AM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
>
> Currently psi_trigger_create() does a lot of things:
> parses the user text input, allocates and initializes
> the psi_trigger structure and turns on the trigger.
> It does it slightly different for two existing types
> of psi_triggers: system-wide and cgroup-wide.
>
> In order to support a new type of psi triggers, which
> will be owned by a bpf program and won't have a user's
> text description, let's refactor psi_trigger_create().
>
> 1. Introduce psi_trigger_type enum:
>    currently PSI_SYSTEM and PSI_CGROUP are valid values.
> 2. Introduce psi_trigger_params structure to avoid passing
>    a large number of parameters to psi_trigger_create().
> 3. Move out the user's input parsing into the new
>    psi_trigger_parse() helper.
> 4. Move out the capabilities check into the new
>    psi_file_privileged() helper.
> 5. Stop relying on t->of for detecting trigger type.

It's worth noting that this is a pure core refactoring without any
functional change (hopefully :))

>
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> ---
>  include/linux/psi.h       | 15 +++++--
>  include/linux/psi_types.h | 33 ++++++++++++++-
>  kernel/cgroup/cgroup.c    | 14 ++++++-
>  kernel/sched/psi.c        | 87 +++++++++++++++++++++++++--------------
>  4 files changed, 112 insertions(+), 37 deletions(-)
>
> diff --git a/include/linux/psi.h b/include/linux/psi.h
> index e0745873e3f2..8178e998d94b 100644
> --- a/include/linux/psi.h
> +++ b/include/linux/psi.h
> @@ -23,14 +23,23 @@ void psi_memstall_enter(unsigned long *flags);
>  void psi_memstall_leave(unsigned long *flags);
>
>  int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res);
> -struct psi_trigger *psi_trigger_create(struct psi_group *group, char *buf,
> -                                      enum psi_res res, struct file *file,
> -                                      struct kernfs_open_file *of);
> +int psi_trigger_parse(struct psi_trigger_params *params, const char *buf);
> +struct psi_trigger *psi_trigger_create(struct psi_group *group,
> +                               const struct psi_trigger_params *param);
>  void psi_trigger_destroy(struct psi_trigger *t);
>
>  __poll_t psi_trigger_poll(void **trigger_ptr, struct file *file,
>                         poll_table *wait);
>
> +static inline bool psi_file_privileged(struct file *file)
> +{
> +       /*
> +        * Checking the privilege here on file->f_cred implies that a privileged user
> +        * could open the file and delegate the write to an unprivileged one.
> +        */
> +       return cap_raised(file->f_cred->cap_effective, CAP_SYS_RESOURCE);
> +}
> +
>  #ifdef CONFIG_CGROUPS
>  static inline struct psi_group *cgroup_psi(struct cgroup *cgrp)
>  {
> diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
> index f1fd3a8044e0..cea54121d9b9 100644
> --- a/include/linux/psi_types.h
> +++ b/include/linux/psi_types.h
> @@ -121,7 +121,38 @@ struct psi_window {
>         u64 prev_growth;
>  };
>
> +enum psi_trigger_type {
> +       PSI_SYSTEM,
> +       PSI_CGROUP,
> +};
> +
> +struct psi_trigger_params {
> +       /* Trigger type */
> +       enum psi_trigger_type type;
> +
> +       /* Resources that workloads could be stalled on */

I would describe this as "Resource to be monitored"

> +       enum psi_res res;
> +
> +       /* True if all threads should be stalled to trigger */
> +       bool full;
> +
> +       /* Threshold in us */
> +       u32 threshold_us;
> +
> +       /* Window in us */
> +       u32 window_us;
> +
> +       /* Privileged triggers are treated differently */
> +       bool privileged;
> +
> +       /* Link to kernfs open file, only for PSI_CGROUP */
> +       struct kernfs_open_file *of;
> +};
> +
>  struct psi_trigger {
> +       /* Trigger type */
> +       enum psi_trigger_type type;
> +
>         /* PSI state being monitored by the trigger */
>         enum psi_states state;
>
> @@ -137,7 +168,7 @@ struct psi_trigger {
>         /* Wait queue for polling */
>         wait_queue_head_t event_wait;
>
> -       /* Kernfs file for cgroup triggers */
> +       /* Kernfs file for PSI_CGROUP triggers */
>         struct kernfs_open_file *of;
>
>         /* Pending event flag */
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index a723b7dc6e4e..9cd3c3a52c21 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -3872,6 +3872,12 @@ static ssize_t pressure_write(struct kernfs_open_file *of, char *buf,
>         struct psi_trigger *new;
>         struct cgroup *cgrp;
>         struct psi_group *psi;
> +       struct psi_trigger_params params;
> +       int err;
> +
> +       err = psi_trigger_parse(&params, buf);
> +       if (err)
> +               return err;
>
>         cgrp = cgroup_kn_lock_live(of->kn, false);
>         if (!cgrp)
> @@ -3887,7 +3893,13 @@ static ssize_t pressure_write(struct kernfs_open_file *of, char *buf,
>         }
>
>         psi = cgroup_psi(cgrp);
> -       new = psi_trigger_create(psi, buf, res, of->file, of);
> +
> +       params.type = PSI_CGROUP;
> +       params.res = res;
> +       params.privileged = psi_file_privileged(of->file);
> +       params.of = of;
> +
> +       new = psi_trigger_create(psi, &params);
>         if (IS_ERR(new)) {
>                 cgroup_put(cgrp);
>                 return PTR_ERR(new);
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index ad04a5c3162a..e1d8eaeeff17 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -489,7 +489,7 @@ static void update_triggers(struct psi_group *group, u64 now,
>
>                 /* Generate an event */
>                 if (cmpxchg(&t->event, 0, 1) == 0) {
> -                       if (t->of)
> +                       if (t->type == PSI_CGROUP)
>                                 kernfs_notify(t->of->kn);
>                         else
>                                 wake_up_interruptible(&t->event_wait);
> @@ -1281,74 +1281,87 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
>         return 0;
>  }
>
> -struct psi_trigger *psi_trigger_create(struct psi_group *group, char *buf,
> -                                      enum psi_res res, struct file *file,
> -                                      struct kernfs_open_file *of)
> +int psi_trigger_parse(struct psi_trigger_params *params, const char *buf)
>  {
> -       struct psi_trigger *t;
> -       enum psi_states state;
> -       u32 threshold_us;
> -       bool privileged;
> -       u32 window_us;
> +       u32 threshold_us, window_us;
>
>         if (static_branch_likely(&psi_disabled))
> -               return ERR_PTR(-EOPNOTSUPP);
> -
> -       /*
> -        * Checking the privilege here on file->f_cred implies that a privileged user
> -        * could open the file and delegate the write to an unprivileged one.
> -        */
> -       privileged = cap_raised(file->f_cred->cap_effective, CAP_SYS_RESOURCE);
> +               return -EOPNOTSUPP;
>
>         if (sscanf(buf, "some %u %u", &threshold_us, &window_us) == 2)
> -               state = PSI_IO_SOME + res * 2;
> +               params->full = false;
>         else if (sscanf(buf, "full %u %u", &threshold_us, &window_us) == 2)
> -               state = PSI_IO_FULL + res * 2;
> +               params->full = true;
>         else
> -               return ERR_PTR(-EINVAL);
> +               return -EINVAL;
> +
> +       params->threshold_us = threshold_us;
> +       params->window_us = window_us;
> +       return 0;
> +}
> +
> +struct psi_trigger *psi_trigger_create(struct psi_group *group,
> +                                      const struct psi_trigger_params *params)
> +{
> +       struct psi_trigger *t;
> +       enum psi_states state;
> +
> +       if (static_branch_likely(&psi_disabled))
> +               return ERR_PTR(-EOPNOTSUPP);
> +
> +       state = params->full ? PSI_IO_FULL : PSI_IO_SOME;
> +       state += params->res * 2;
>
>  #ifdef CONFIG_IRQ_TIME_ACCOUNTING
> -       if (res == PSI_IRQ && --state != PSI_IRQ_FULL)
> +       if (params->res == PSI_IRQ && --state != PSI_IRQ_FULL)
>                 return ERR_PTR(-EINVAL);
>  #endif
>
>         if (state >= PSI_NONIDLE)
>                 return ERR_PTR(-EINVAL);
>
> -       if (window_us == 0 || window_us > WINDOW_MAX_US)
> +       if (params->window_us == 0 || params->window_us > WINDOW_MAX_US)
>                 return ERR_PTR(-EINVAL);
>
>         /*
>          * Unprivileged users can only use 2s windows so that averages aggregation
>          * work is used, and no RT threads need to be spawned.
>          */
> -       if (!privileged && window_us % 2000000)
> +       if (!params->privileged && params->window_us % 2000000)
>                 return ERR_PTR(-EINVAL);
>
>         /* Check threshold */
> -       if (threshold_us == 0 || threshold_us > window_us)
> +       if (params->threshold_us == 0 || params->threshold_us > params->window_us)
>                 return ERR_PTR(-EINVAL);
>
>         t = kmalloc(sizeof(*t), GFP_KERNEL);
>         if (!t)
>                 return ERR_PTR(-ENOMEM);
>
> +       t->type = params->type;
>         t->group = group;
>         t->state = state;
> -       t->threshold = threshold_us * NSEC_PER_USEC;
> -       t->win.size = window_us * NSEC_PER_USEC;
> +       t->threshold = params->threshold_us * NSEC_PER_USEC;
> +       t->win.size = params->window_us * NSEC_PER_USEC;
>         window_reset(&t->win, sched_clock(),
>                         group->total[PSI_POLL][t->state], 0);
>
>         t->event = 0;
>         t->last_event_time = 0;
> -       t->of = of;
> -       if (!of)
> +
> +       switch (params->type) {
> +       case PSI_SYSTEM:
>                 init_waitqueue_head(&t->event_wait);

I think t->of will be left uninitialized here. Let's set it to NULL please.


> +               break;
> +       case PSI_CGROUP:
> +               t->of = params->of;
> +               break;
> +       }
> +
>         t->pending_event = false;
> -       t->aggregator = privileged ? PSI_POLL : PSI_AVGS;
> +       t->aggregator = params->privileged ? PSI_POLL : PSI_AVGS;
>
> -       if (privileged) {
> +       if (params->privileged) {
>                 mutex_lock(&group->rtpoll_trigger_lock);
>
>                 if (!rcu_access_pointer(group->rtpoll_task)) {
> @@ -1401,7 +1414,7 @@ void psi_trigger_destroy(struct psi_trigger *t)
>          * being accessed later. Can happen if cgroup is deleted from under a
>          * polling process.
>          */
> -       if (t->of)
> +       if (t->type == PSI_CGROUP)
>                 kernfs_notify(t->of->kn);
>         else
>                 wake_up_interruptible(&t->event_wait);
> @@ -1481,7 +1494,7 @@ __poll_t psi_trigger_poll(void **trigger_ptr,
>         if (!t)
>                 return DEFAULT_POLLMASK | EPOLLERR | EPOLLPRI;
>
> -       if (t->of)
> +       if (t->type == PSI_CGROUP)
>                 kernfs_generic_poll(t->of, wait);
>         else
>                 poll_wait(file, &t->event_wait, wait);
> @@ -1530,6 +1543,8 @@ static ssize_t psi_write(struct file *file, const char __user *user_buf,
>         size_t buf_size;
>         struct seq_file *seq;
>         struct psi_trigger *new;
> +       struct psi_trigger_params params;
> +       int err;
>
>         if (static_branch_likely(&psi_disabled))
>                 return -EOPNOTSUPP;
> @@ -1543,6 +1558,10 @@ static ssize_t psi_write(struct file *file, const char __user *user_buf,
>
>         buf[buf_size - 1] = '\0';
>
> +       err = psi_trigger_parse(&params, buf);
> +       if (err)
> +               return err;
> +
>         seq = file->private_data;
>
>         /* Take seq->lock to protect seq->private from concurrent writes */
> @@ -1554,7 +1573,11 @@ static ssize_t psi_write(struct file *file, const char __user *user_buf,
>                 return -EBUSY;
>         }
>
> -       new = psi_trigger_create(&psi_system, buf, res, file, NULL);
> +       params.type = PSI_SYSTEM;
> +       params.res = res;
> +       params.privileged = psi_file_privileged(file);
> +
> +       new = psi_trigger_create(&psi_system, &params);
>         if (IS_ERR(new)) {
>                 mutex_unlock(&seq->lock);
>                 return PTR_ERR(new);
> --
> 2.50.1
>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 12/14] sched: psi: implement psi trigger handling using bpf
  2025-08-18 17:01 ` [PATCH v1 12/14] sched: psi: implement psi trigger handling using bpf Roman Gushchin
@ 2025-08-19  4:11   ` Suren Baghdasaryan
  2025-08-19 22:31     ` Roman Gushchin
  2025-08-26 17:03   ` Amery Hung
  1 sibling, 1 reply; 67+ messages in thread
From: Suren Baghdasaryan @ 2025-08-19  4:11 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, bpf, Johannes Weiner, Michal Hocko, David Rientjes,
	Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel

On Mon, Aug 18, 2025 at 10:02 AM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
>
> This patch implements a bpf struct ops-based mechanism to create
> psi triggers, attach them to cgroups or system wide and handle
> psi events in bpf.
>
> The struct ops provides 3 callbacks:
>   - init() called once at load, handy for creating psi triggers
>   - handle_psi_event() called every time a psi trigger fires
>   - handle_cgroup_free() called if a cgroup with an attached
>     trigger is being freed
>
> A single struct ops can create a number of psi triggers, both
> cgroup-scoped and system-wide.
>
> All 3 struct ops callbacks can be sleepable. handle_psi_event()
> handlers are executed using a separate workqueue, so it won't
> affect the latency of other psi triggers.

I'll need to stare some more into this code but overall it makes sense
to me. Some early comments below.

>
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> ---
>  include/linux/bpf_psi.h      |  71 ++++++++++
>  include/linux/psi_types.h    |  43 +++++-
>  kernel/sched/bpf_psi.c       | 253 +++++++++++++++++++++++++++++++++++
>  kernel/sched/build_utility.c |   4 +
>  kernel/sched/psi.c           |  49 +++++--
>  5 files changed, 408 insertions(+), 12 deletions(-)
>  create mode 100644 include/linux/bpf_psi.h
>  create mode 100644 kernel/sched/bpf_psi.c
>
> diff --git a/include/linux/bpf_psi.h b/include/linux/bpf_psi.h
> new file mode 100644
> index 000000000000..826ab89ac11c
> --- /dev/null
> +++ b/include/linux/bpf_psi.h
> @@ -0,0 +1,71 @@
> +/* SPDX-License-Identifier: GPL-2.0+ */
> +
> +#ifndef __BPF_PSI_H
> +#define __BPF_PSI_H
> +
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +#include <linux/srcu.h>
> +#include <linux/psi_types.h>
> +
> +struct cgroup;
> +struct bpf_psi;
> +struct psi_trigger;
> +struct psi_trigger_params;
> +
> +#define BPF_PSI_FULL 0x80000000
> +
> +struct bpf_psi_ops {
> +       /**
> +        * @init: Initialization callback, suited for creating psi triggers.
> +        * @bpf_psi: bpf_psi pointer, can be passed to bpf_psi_create_trigger().
> +        *
> +        * A non-0 return value means the initialization has been failed.
> +        */
> +       int (*init)(struct bpf_psi *bpf_psi);
> +
> +       /**
> +        * @handle_psi_event: PSI event callback
> +        * @t: psi_trigger pointer
> +        */
> +       void (*handle_psi_event)(struct psi_trigger *t);
> +
> +       /**
> +        * @handle_cgroup_free: Cgroup free callback
> +        * @cgroup_id: Id of freed cgroup
> +        *
> +        * Called every time a cgroup with an attached bpf psi trigger is freed.
> +        * No psi events can be raised after handle_cgroup_free().
> +        */
> +       void (*handle_cgroup_free)(u64 cgroup_id);
> +
> +       /* private */
> +       struct bpf_psi *bpf_psi;
> +};
> +
> +struct bpf_psi {
> +       spinlock_t lock;
> +       struct list_head triggers;
> +       struct bpf_psi_ops *ops;
> +       struct srcu_struct srcu;
> +};
> +
> +#ifdef CONFIG_BPF_SYSCALL
> +void bpf_psi_add_trigger(struct psi_trigger *t,
> +                        const struct psi_trigger_params *params);
> +void bpf_psi_remove_trigger(struct psi_trigger *t);
> +void bpf_psi_handle_event(struct psi_trigger *t);
> +#ifdef CONFIG_CGROUPS
> +void bpf_psi_cgroup_free(struct cgroup *cgroup);
> +#endif
> +
> +#else /* CONFIG_BPF_SYSCALL */
> +static inline void bpf_psi_add_trigger(struct psi_trigger *t,
> +                       const struct psi_trigger_params *params) {}
> +static inline void bpf_psi_remove_trigger(struct psi_trigger *t) {}
> +static inline void bpf_psi_handle_event(struct psi_trigger *t) {}
> +static inline void bpf_psi_cgroup_free(struct cgroup *cgroup) {}
> +
> +#endif /* CONFIG_BPF_SYSCALL */
> +
> +#endif /* __BPF_PSI_H */
> diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
> index cea54121d9b9..f695cc34cfd4 100644
> --- a/include/linux/psi_types.h
> +++ b/include/linux/psi_types.h
> @@ -124,6 +124,7 @@ struct psi_window {
>  enum psi_trigger_type {
>         PSI_SYSTEM,
>         PSI_CGROUP,
> +       PSI_BPF,
>  };
>
>  struct psi_trigger_params {
> @@ -145,8 +146,15 @@ struct psi_trigger_params {
>         /* Privileged triggers are treated differently */
>         bool privileged;
>
> -       /* Link to kernfs open file, only for PSI_CGROUP */
> -       struct kernfs_open_file *of;
> +       union {
> +               /* Link to kernfs open file, only for PSI_CGROUP */
> +               struct kernfs_open_file *of;
> +
> +#ifdef CONFIG_BPF_SYSCALL
> +               /* Link to bpf_psi structure, only for BPF_PSI */
> +               struct bpf_psi *bpf_psi;
> +#endif
> +       };
>  };
>
>  struct psi_trigger {
> @@ -188,6 +196,31 @@ struct psi_trigger {
>
>         /* Trigger type - PSI_AVGS for unprivileged, PSI_POLL for RT */
>         enum psi_aggregators aggregator;
> +
> +#ifdef CONFIG_BPF_SYSCALL
> +       /* Fields specific to PSI_BPF triggers */
> +
> +       /* Bpf psi structure for events handling */
> +       struct bpf_psi *bpf_psi;
> +
> +       /* List node inside bpf_psi->triggers list */
> +       struct list_head bpf_psi_node;
> +
> +       /* List node inside group->bpf_triggers list */
> +       struct list_head bpf_group_node;
> +
> +       /* Work structure, used to execute event handlers */
> +       struct work_struct bpf_work;

I think bpf_work can be moved into struct bpf_psi as you are using it
get to bpf_psi anyway:

       t = container_of(work, struct psi_trigger, bpf_work);
       bpf_psi = READ_ONCE(t->bpf_psi);

> +
> +       /*
> +        * Whether the trigger is being pinned in memory.
> +        * Protected by group->bpf_triggers_lock.
> +        */
> +       bool pinned;

Same with pinned field. I think you are using it only with triggers
which have a valid t->bpf_work, so might as well move in there. I
would also call this field "isolated" rather than "pinned" but that's
just a preference.

> +
> +       /* Cgroup Id */
> +       u64 cgroup_id;

This cgroup_id field is weird. It's not initialized and not used here,
then it gets initialized in the next patch and used in the last patch
from a selftest. This is quite confusing. Also logically I don't think
a cgroup attribute really belongs to psi_trigger... Can we at least
move it into bpf_psi where it might fit a bit better?

> +#endif
>  };
>
>  struct psi_group {
> @@ -236,6 +269,12 @@ struct psi_group {
>         u64 rtpoll_total[NR_PSI_STATES - 1];
>         u64 rtpoll_next_update;
>         u64 rtpoll_until;
> +
> +#ifdef CONFIG_BPF_SYSCALL
> +       /* List of triggers owned by bpf and corresponding lock */
> +       spinlock_t bpf_triggers_lock;
> +       struct list_head bpf_triggers;
> +#endif
>  };
>
>  #else /* CONFIG_PSI */
> diff --git a/kernel/sched/bpf_psi.c b/kernel/sched/bpf_psi.c
> new file mode 100644
> index 000000000000..2ea9d7276b21
> --- /dev/null
> +++ b/kernel/sched/bpf_psi.c
> @@ -0,0 +1,253 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * BPF PSI event handlers
> + *
> + * Author: Roman Gushchin <roman.gushchin@linux.dev>
> + */
> +
> +#include <linux/bpf_psi.h>
> +#include <linux/cgroup-defs.h>
> +
> +static struct workqueue_struct *bpf_psi_wq;
> +
> +static struct bpf_psi *bpf_psi_create(struct bpf_psi_ops *ops)
> +{
> +       struct bpf_psi *bpf_psi;
> +
> +       bpf_psi = kzalloc(sizeof(*bpf_psi), GFP_KERNEL);
> +       if (!bpf_psi)
> +               return NULL;
> +
> +       if (init_srcu_struct(&bpf_psi->srcu)) {
> +               kfree(bpf_psi);
> +               return NULL;
> +       }
> +
> +       spin_lock_init(&bpf_psi->lock);
> +       bpf_psi->ops = ops;
> +       INIT_LIST_HEAD(&bpf_psi->triggers);
> +       ops->bpf_psi = bpf_psi;
> +
> +       return bpf_psi;
> +}
> +
> +static void bpf_psi_free(struct bpf_psi *bpf_psi)
> +{
> +       cleanup_srcu_struct(&bpf_psi->srcu);
> +       kfree(bpf_psi);
> +}
> +
> +static void bpf_psi_handle_event_fn(struct work_struct *work)
> +{
> +       struct psi_trigger *t;
> +       struct bpf_psi *bpf_psi;
> +       int idx;
> +
> +       t = container_of(work, struct psi_trigger, bpf_work);
> +       bpf_psi = READ_ONCE(t->bpf_psi);
> +
> +       if (likely(bpf_psi)) {
> +               idx = srcu_read_lock(&bpf_psi->srcu);
> +               if (bpf_psi->ops->handle_psi_event)
> +                       bpf_psi->ops->handle_psi_event(t);
> +               srcu_read_unlock(&bpf_psi->srcu, idx);
> +       }
> +}
> +
> +void bpf_psi_add_trigger(struct psi_trigger *t,
> +                        const struct psi_trigger_params *params)
> +{
> +       t->bpf_psi = params->bpf_psi;
> +       t->pinned = false;
> +       INIT_WORK(&t->bpf_work, bpf_psi_handle_event_fn);
> +
> +       spin_lock(&t->bpf_psi->lock);
> +       list_add(&t->bpf_psi_node, &t->bpf_psi->triggers);
> +       spin_unlock(&t->bpf_psi->lock);
> +
> +       spin_lock(&t->group->bpf_triggers_lock);
> +       list_add(&t->bpf_group_node, &t->group->bpf_triggers);
> +       spin_unlock(&t->group->bpf_triggers_lock);
> +}
> +
> +void bpf_psi_remove_trigger(struct psi_trigger *t)
> +{
> +       spin_lock(&t->group->bpf_triggers_lock);
> +       list_del(&t->bpf_group_node);
> +       spin_unlock(&t->group->bpf_triggers_lock);
> +
> +       spin_lock(&t->bpf_psi->lock);
> +       list_del(&t->bpf_psi_node);
> +       spin_unlock(&t->bpf_psi->lock);
> +}
> +
> +#ifdef CONFIG_CGROUPS
> +void bpf_psi_cgroup_free(struct cgroup *cgroup)
> +{
> +       struct psi_group *group = cgroup->psi;
> +       u64 cgrp_id = cgroup_id(cgroup);
> +       struct psi_trigger *t, *p;
> +       struct bpf_psi *bpf_psi;
> +       LIST_HEAD(to_destroy);
> +       int idx;
> +
> +       spin_lock(&group->bpf_triggers_lock);
> +       list_for_each_entry_safe(t, p, &group->bpf_triggers, bpf_group_node) {
> +               if (!t->pinned) {
> +                       t->pinned = true;
> +                       list_move(&t->bpf_group_node, &to_destroy);
> +               }
> +       }
> +       spin_unlock(&group->bpf_triggers_lock);
> +
> +       list_for_each_entry_safe(t, p, &to_destroy, bpf_group_node) {
> +               bpf_psi = READ_ONCE(t->bpf_psi);
> +
> +               idx = srcu_read_lock(&bpf_psi->srcu);
> +               if (bpf_psi->ops->handle_cgroup_free)
> +                       bpf_psi->ops->handle_cgroup_free(cgrp_id);
> +               srcu_read_unlock(&bpf_psi->srcu, idx);
> +
> +               spin_lock(&bpf_psi->lock);
> +               list_del(&t->bpf_psi_node);
> +               spin_unlock(&bpf_psi->lock);
> +
> +               WRITE_ONCE(t->bpf_psi, NULL);
> +               flush_workqueue(bpf_psi_wq);
> +               synchronize_srcu(&bpf_psi->srcu);
> +               psi_trigger_destroy(t);
> +       }
> +}
> +#endif
> +
> +void bpf_psi_handle_event(struct psi_trigger *t)
> +{
> +       queue_work(bpf_psi_wq, &t->bpf_work);
> +}
> +
> +// bpf struct ops

C++ style comment?

> +
> +static int __bpf_psi_init(struct bpf_psi *bpf_psi) { return 0; }
> +static void __bpf_psi_handle_psi_event(struct psi_trigger *t) {}
> +static void __bpf_psi_handle_cgroup_free(u64 cgroup_id) {}
> +
> +static struct bpf_psi_ops __bpf_psi_ops = {
> +       .init = __bpf_psi_init,
> +       .handle_psi_event = __bpf_psi_handle_psi_event,
> +       .handle_cgroup_free = __bpf_psi_handle_cgroup_free,
> +};
> +
> +static const struct bpf_func_proto *
> +bpf_psi_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> +{
> +       return tracing_prog_func_proto(func_id, prog);
> +}
> +
> +static bool bpf_psi_ops_is_valid_access(int off, int size,
> +                                       enum bpf_access_type type,
> +                                       const struct bpf_prog *prog,
> +                                       struct bpf_insn_access_aux *info)
> +{
> +       return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
> +}
> +
> +static const struct bpf_verifier_ops bpf_psi_verifier_ops = {
> +       .get_func_proto = bpf_psi_func_proto,
> +       .is_valid_access = bpf_psi_ops_is_valid_access,
> +};
> +
> +static int bpf_psi_ops_reg(void *kdata, struct bpf_link *link)
> +{
> +       struct bpf_psi_ops *ops = kdata;
> +       struct bpf_psi *bpf_psi;
> +
> +       bpf_psi = bpf_psi_create(ops);
> +       if (!bpf_psi)
> +               return -ENOMEM;
> +
> +       return ops->init(bpf_psi);
> +}
> +
> +static void bpf_psi_ops_unreg(void *kdata, struct bpf_link *link)
> +{
> +       struct bpf_psi_ops *ops = kdata;
> +       struct bpf_psi *bpf_psi = ops->bpf_psi;
> +       struct psi_trigger *t, *p;
> +       LIST_HEAD(to_destroy);
> +
> +       spin_lock(&bpf_psi->lock);
> +       list_for_each_entry_safe(t, p, &bpf_psi->triggers, bpf_psi_node) {
> +               spin_lock(&t->group->bpf_triggers_lock);
> +               if (!t->pinned) {
> +                       t->pinned = true;
> +                       list_move(&t->bpf_group_node, &to_destroy);
> +                       list_del(&t->bpf_psi_node);
> +
> +                       WRITE_ONCE(t->bpf_psi, NULL);
> +               }
> +               spin_unlock(&t->group->bpf_triggers_lock);
> +       }
> +       spin_unlock(&bpf_psi->lock);
> +
> +       flush_workqueue(bpf_psi_wq);
> +       synchronize_srcu(&bpf_psi->srcu);
> +
> +       list_for_each_entry_safe(t, p, &to_destroy, bpf_group_node)
> +               psi_trigger_destroy(t);
> +
> +       bpf_psi_free(bpf_psi);
> +}
> +
> +static int bpf_psi_ops_check_member(const struct btf_type *t,
> +                                   const struct btf_member *member,
> +                                   const struct bpf_prog *prog)
> +{
> +       return 0;
> +}
> +
> +static int bpf_psi_ops_init_member(const struct btf_type *t,
> +                                  const struct btf_member *member,
> +                                  void *kdata, const void *udata)
> +{
> +       return 0;
> +}
> +
> +static int bpf_psi_ops_init(struct btf *btf)
> +{
> +       return 0;
> +}
> +
> +static struct bpf_struct_ops bpf_psi_bpf_ops = {
> +       .verifier_ops = &bpf_psi_verifier_ops,
> +       .reg = bpf_psi_ops_reg,
> +       .unreg = bpf_psi_ops_unreg,
> +       .check_member = bpf_psi_ops_check_member,
> +       .init_member = bpf_psi_ops_init_member,
> +       .init = bpf_psi_ops_init,
> +       .name = "bpf_psi_ops",
> +       .owner = THIS_MODULE,
> +       .cfi_stubs = &__bpf_psi_ops
> +};
> +
> +static int __init bpf_psi_struct_ops_init(void)
> +{
> +       int wq_flags = WQ_MEM_RECLAIM | WQ_UNBOUND | WQ_HIGHPRI;
> +       int err;
> +
> +       bpf_psi_wq = alloc_workqueue("bpf_psi_wq", wq_flags, 0);
> +       if (!bpf_psi_wq)
> +               return -ENOMEM;
> +
> +       err = register_bpf_struct_ops(&bpf_psi_bpf_ops, bpf_psi_ops);
> +       if (err) {
> +               pr_warn("error while registering bpf psi struct ops: %d", err);
> +               goto err;
> +       }
> +
> +       return 0;
> +
> +err:
> +       destroy_workqueue(bpf_psi_wq);
> +       return err;
> +}
> +late_initcall(bpf_psi_struct_ops_init);
> diff --git a/kernel/sched/build_utility.c b/kernel/sched/build_utility.c
> index bf9d8db94b70..80f3799a2fa6 100644
> --- a/kernel/sched/build_utility.c
> +++ b/kernel/sched/build_utility.c
> @@ -19,6 +19,7 @@
>  #include <linux/sched/rseq_api.h>
>  #include <linux/sched/task_stack.h>
>
> +#include <linux/bpf_psi.h>
>  #include <linux/cpufreq.h>
>  #include <linux/cpumask_api.h>
>  #include <linux/cpuset.h>
> @@ -92,6 +93,9 @@
>
>  #ifdef CONFIG_PSI
>  # include "psi.c"
> +# ifdef CONFIG_BPF_SYSCALL
> +#  include "bpf_psi.c"
> +# endif
>  #endif
>
>  #ifdef CONFIG_MEMBARRIER
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index e1d8eaeeff17..e10fbbc34099 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -201,6 +201,10 @@ static void group_init(struct psi_group *group)
>         init_waitqueue_head(&group->rtpoll_wait);
>         timer_setup(&group->rtpoll_timer, poll_timer_fn, 0);
>         rcu_assign_pointer(group->rtpoll_task, NULL);
> +#ifdef CONFIG_BPF_SYSCALL
> +       spin_lock_init(&group->bpf_triggers_lock);
> +       INIT_LIST_HEAD(&group->bpf_triggers);
> +#endif
>  }
>
>  void __init psi_init(void)
> @@ -489,10 +493,17 @@ static void update_triggers(struct psi_group *group, u64 now,
>
>                 /* Generate an event */
>                 if (cmpxchg(&t->event, 0, 1) == 0) {
> -                       if (t->type == PSI_CGROUP)
> -                               kernfs_notify(t->of->kn);
> -                       else
> +                       switch (t->type) {
> +                       case PSI_SYSTEM:
>                                 wake_up_interruptible(&t->event_wait);
> +                               break;
> +                       case PSI_CGROUP:
> +                               kernfs_notify(t->of->kn);
> +                               break;
> +                       case PSI_BPF:
> +                               bpf_psi_handle_event(t);
> +                               break;
> +                       }
>                 }
>                 t->last_event_time = now;
>                 /* Reset threshold breach flag once event got generated */
> @@ -1125,6 +1136,7 @@ void psi_cgroup_free(struct cgroup *cgroup)
>                 return;
>
>         cancel_delayed_work_sync(&cgroup->psi->avgs_work);
> +       bpf_psi_cgroup_free(cgroup);
>         free_percpu(cgroup->psi->pcpu);
>         /* All triggers must be removed by now */
>         WARN_ONCE(cgroup->psi->rtpoll_states, "psi: trigger leak\n");
> @@ -1356,6 +1368,9 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group,
>         case PSI_CGROUP:
>                 t->of = params->of;
>                 break;
> +       case PSI_BPF:
> +               bpf_psi_add_trigger(t, params);
> +               break;
>         }
>
>         t->pending_event = false;
> @@ -1369,8 +1384,10 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group,
>
>                         task = kthread_create(psi_rtpoll_worker, group, "psimon");
>                         if (IS_ERR(task)) {
> -                               kfree(t);
>                                 mutex_unlock(&group->rtpoll_trigger_lock);
> +                               if (t->type == PSI_BPF)
> +                                       bpf_psi_remove_trigger(t);
> +                               kfree(t);
>                                 return ERR_CAST(task);
>                         }
>                         atomic_set(&group->rtpoll_wakeup, 0);
> @@ -1414,10 +1431,16 @@ void psi_trigger_destroy(struct psi_trigger *t)

Will this function be ever called for PSI_BPF triggers? Same question
for psi_trigger_poll().




>          * being accessed later. Can happen if cgroup is deleted from under a
>          * polling process.
>          */
> -       if (t->type == PSI_CGROUP)
> -               kernfs_notify(t->of->kn);
> -       else
> +       switch (t->type) {
> +       case PSI_SYSTEM:
>                 wake_up_interruptible(&t->event_wait);
> +               break;
> +       case PSI_CGROUP:
> +               kernfs_notify(t->of->kn);
> +               break;
> +       case PSI_BPF:
> +               break;
> +       }
>
>         if (t->aggregator == PSI_AVGS) {
>                 mutex_lock(&group->avgs_lock);
> @@ -1494,10 +1517,16 @@ __poll_t psi_trigger_poll(void **trigger_ptr,
>         if (!t)
>                 return DEFAULT_POLLMASK | EPOLLERR | EPOLLPRI;
>
> -       if (t->type == PSI_CGROUP)
> -               kernfs_generic_poll(t->of, wait);
> -       else
> +       switch (t->type) {
> +       case PSI_SYSTEM:
>                 poll_wait(file, &t->event_wait, wait);
> +               break;
> +       case PSI_CGROUP:
> +               kernfs_generic_poll(t->of, wait);
> +               break;
> +       case PSI_BPF:
> +               break;
> +       }
>
>         if (cmpxchg(&t->event, 1, 0) == 1)
>                 ret |= EPOLLPRI;
> --
> 2.50.1
>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 00/14] mm: BPF OOM
  2025-08-19  4:08 ` [PATCH v1 00/14] mm: BPF OOM Suren Baghdasaryan
@ 2025-08-19 19:52   ` Roman Gushchin
  0 siblings, 0 replies; 67+ messages in thread
From: Roman Gushchin @ 2025-08-19 19:52 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: linux-mm, bpf, Johannes Weiner, Michal Hocko, David Rientjes,
	Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel

Suren Baghdasaryan <surenb@google.com> writes:

> On Mon, Aug 18, 2025 at 10:01 AM Roman Gushchin
> <roman.gushchin@linux.dev> wrote:
>>
>> This patchset adds an ability to customize the out of memory
>> handling using bpf.
>>
>> It focuses on two parts:
>> 1) OOM handling policy,
>> 2) PSI-based OOM invocation.
>>
>> The idea to use bpf for customizing the OOM handling is not new, but
>> unlike the previous proposal [1], which augmented the existing task
>> ranking policy, this one tries to be as generic as possible and
>> leverage the full power of the modern bpf.
>>
>> It provides a generic interface which is called before the existing OOM
>> killer code and allows implementing any policy, e.g. picking a victim
>> task or memory cgroup or potentially even releasing memory in other
>> ways, e.g. deleting tmpfs files (the last one might require some
>> additional but relatively simple changes).
>>
>> The past attempt to implement memory-cgroup aware policy [2] showed
>> that there are multiple opinions on what the best policy is.  As it's
>> highly workload-dependent and specific to a concrete way of organizing
>> workloads, the structure of the cgroup tree etc, a customizable
>> bpf-based implementation is preferable over a in-kernel implementation
>> with a dozen on sysctls.
>
> s/on/of ?

Fixed, thanks.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 01/14] mm: introduce bpf struct ops for OOM handling
  2025-08-19  4:09   ` Suren Baghdasaryan
@ 2025-08-19 20:06     ` Roman Gushchin
  2025-08-20 19:34       ` Suren Baghdasaryan
  0 siblings, 1 reply; 67+ messages in thread
From: Roman Gushchin @ 2025-08-19 20:06 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: linux-mm, bpf, Johannes Weiner, Michal Hocko, David Rientjes,
	Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel

Suren Baghdasaryan <surenb@google.com> writes:

> On Mon, Aug 18, 2025 at 10:01 AM Roman Gushchin
> <roman.gushchin@linux.dev> wrote:
>>
>> Introduce a bpf struct ops for implementing custom OOM handling policies.
>>
>> The struct ops provides the bpf_handle_out_of_memory() callback,
>> which expected to return 1 if it was able to free some memory and 0
>> otherwise.
>>
>> In the latter case it's guaranteed that the in-kernel OOM killer will
>> be invoked. Otherwise the kernel also checks the bpf_memory_freed
>> field of the oom_control structure, which is expected to be set by
>> kfuncs suitable for releasing memory. It's a safety mechanism which
>> prevents a bpf program to claim forward progress without actually
>> releasing memory. The callback program is sleepable to enable using
>> iterators, e.g. cgroup iterators.
>>
>> The callback receives struct oom_control as an argument, so it can
>> easily filter out OOM's it doesn't want to handle, e.g. global vs
>> memcg OOM's.
>>
>> The callback is executed just before the kernel victim task selection
>> algorithm, so all heuristics and sysctls like panic on oom,
>> sysctl_oom_kill_allocating_task and sysctl_oom_kill_allocating_task
>> are respected.
>>
>> The struct ops also has the name field, which allows to define a
>> custom name for the implemented policy. It's printed in the OOM report
>> in the oom_policy=<policy> format. "default" is printed if bpf is not
>> used or policy name is not specified.
>>
>> [  112.696676] test_progs invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
>>                oom_policy=bpf_test_policy
>> [  112.698160] CPU: 1 UID: 0 PID: 660 Comm: test_progs Not tainted 6.16.0-00015-gf09eb0d6badc #102 PREEMPT(full)
>> [  112.698165] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-5.fc42 04/01/2014
>> [  112.698167] Call Trace:
>> [  112.698177]  <TASK>
>> [  112.698182]  dump_stack_lvl+0x4d/0x70
>> [  112.698192]  dump_header+0x59/0x1c6
>> [  112.698199]  oom_kill_process.cold+0x8/0xef
>> [  112.698206]  bpf_oom_kill_process+0x59/0xb0
>> [  112.698216]  bpf_prog_7ecad0f36a167fd7_test_out_of_memory+0x2be/0x313
>> [  112.698229]  bpf__bpf_oom_ops_handle_out_of_memory+0x47/0xaf
>> [  112.698236]  ? srso_alias_return_thunk+0x5/0xfbef5
>> [  112.698240]  bpf_handle_oom+0x11a/0x1e0
>> [  112.698250]  out_of_memory+0xab/0x5c0
>> [  112.698258]  mem_cgroup_out_of_memory+0xbc/0x110
>> [  112.698274]  try_charge_memcg+0x4b5/0x7e0
>> [  112.698288]  charge_memcg+0x2f/0xc0
>> [  112.698293]  __mem_cgroup_charge+0x30/0xc0
>> [  112.698299]  do_anonymous_page+0x40f/0xa50
>> [  112.698311]  __handle_mm_fault+0xbba/0x1140
>> [  112.698317]  ? srso_alias_return_thunk+0x5/0xfbef5
>> [  112.698335]  handle_mm_fault+0xe6/0x370
>> [  112.698343]  do_user_addr_fault+0x211/0x6a0
>> [  112.698354]  exc_page_fault+0x75/0x1d0
>> [  112.698363]  asm_exc_page_fault+0x26/0x30
>> [  112.698366] RIP: 0033:0x7fa97236db00
>>
>> It's possible to load multiple bpf struct programs. In the case of
>> oom, they will be executed one by one in the same order they been
>> loaded until one of them returns 1 and bpf_memory_freed is set to 1
>> - an indication that the memory was freed. This allows to have
>> multiple bpf programs to focus on different types of OOM's - e.g.
>> one program can only handle memcg OOM's in one memory cgroup.
>> But the filtering is done in bpf - so it's fully flexible.
>>
>> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
>> ---
>>  include/linux/bpf_oom.h |  49 +++++++++++++
>>  include/linux/oom.h     |   8 ++
>>  mm/Makefile             |   3 +
>>  mm/bpf_oom.c            | 157 ++++++++++++++++++++++++++++++++++++++++
>>  mm/oom_kill.c           |  22 +++++-
>>  5 files changed, 237 insertions(+), 2 deletions(-)
>>  create mode 100644 include/linux/bpf_oom.h
>>  create mode 100644 mm/bpf_oom.c
>>
>> diff --git a/include/linux/bpf_oom.h b/include/linux/bpf_oom.h
>> new file mode 100644
>> index 000000000000..29cb5ea41d97
>> --- /dev/null
>> +++ b/include/linux/bpf_oom.h
>> @@ -0,0 +1,49 @@
>> +/* SPDX-License-Identifier: GPL-2.0+ */
>> +
>> +#ifndef __BPF_OOM_H
>> +#define __BPF_OOM_H
>> +
>> +struct bpf_oom;
>> +struct oom_control;
>> +
>> +#define BPF_OOM_NAME_MAX_LEN 64
>> +
>> +struct bpf_oom_ops {
>> +       /**
>> +        * @handle_out_of_memory: Out of memory bpf handler, called before
>> +        * the in-kernel OOM killer.
>> +        * @oc: OOM control structure
>> +        *
>> +        * Should return 1 if some memory was freed up, otherwise
>> +        * the in-kernel OOM killer is invoked.
>> +        */
>> +       int (*handle_out_of_memory)(struct oom_control *oc);
>> +
>> +       /**
>> +        * @name: BPF OOM policy name
>> +        */
>> +       char name[BPF_OOM_NAME_MAX_LEN];
>
> Why should the name be a part of ops structure? IMO it's not an
> attribute of the operations but rather of the oom handler which is
> represented by bpf_oom here.

The ops structure describes a user-defined oom policy. Currently
it's just one handler and the policy name. Later additional handlers
can be added, e.g. a handler to control the dmesg output.

bpf_oom is an implementation detail: it's basically an extension
to struct bpf_oom_ops which contains "private" fields required
for the internal machinery.

>
>> +
>> +       /* Private */
>> +       struct bpf_oom *bpf_oom;
>> +};
>> +
>> +#ifdef CONFIG_BPF_SYSCALL
>> +/**
>> + * @bpf_handle_oom: handle out of memory using bpf programs
>> + * @oc: OOM control structure
>> + *
>> + * Returns true if a bpf oom program was executed, returned 1
>> + * and some memory was actually freed.
>
> The above comment is unclear, please clarify.

Fixed, thanks.

/**
 * @bpf_handle_oom: handle out of memory condition using bpf
 * @oc: OOM control structure
 *
 * Returns true if some memory was freed.
 */
bool bpf_handle_oom(struct oom_control *oc);


>
>> + */
>> +bool bpf_handle_oom(struct oom_control *oc);
>> +
>> +#else /* CONFIG_BPF_SYSCALL */
>> +static inline bool bpf_handle_oom(struct oom_control *oc)
>> +{
>> +       return false;
>> +}
>> +
>> +#endif /* CONFIG_BPF_SYSCALL */
>> +
>> +#endif /* __BPF_OOM_H */
>> diff --git a/include/linux/oom.h b/include/linux/oom.h
>> index 1e0fc6931ce9..ef453309b7ea 100644
>> --- a/include/linux/oom.h
>> +++ b/include/linux/oom.h
>> @@ -51,6 +51,14 @@ struct oom_control {
>>
>>         /* Used to print the constraint info. */
>>         enum oom_constraint constraint;
>> +
>> +#ifdef CONFIG_BPF_SYSCALL
>> +       /* Used by the bpf oom implementation to mark the forward progress */
>> +       bool bpf_memory_freed;
>> +
>> +       /* Policy name */
>> +       const char *bpf_policy_name;
>> +#endif
>>  };
>>
>>  extern struct mutex oom_lock;
>> diff --git a/mm/Makefile b/mm/Makefile
>> index 1a7a11d4933d..a714aba03759 100644
>> --- a/mm/Makefile
>> +++ b/mm/Makefile
>> @@ -105,6 +105,9 @@ obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
>>  ifdef CONFIG_SWAP
>>  obj-$(CONFIG_MEMCG) += swap_cgroup.o
>>  endif
>> +ifdef CONFIG_BPF_SYSCALL
>> +obj-y += bpf_oom.o
>> +endif
>>  obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
>>  obj-$(CONFIG_GUP_TEST) += gup_test.o
>>  obj-$(CONFIG_DMAPOOL_TEST) += dmapool_test.o
>> diff --git a/mm/bpf_oom.c b/mm/bpf_oom.c
>> new file mode 100644
>> index 000000000000..47633046819c
>> --- /dev/null
>> +++ b/mm/bpf_oom.c
>> @@ -0,0 +1,157 @@
>> +// SPDX-License-Identifier: GPL-2.0-or-later
>> +/*
>> + * BPF-driven OOM killer customization
>> + *
>> + * Author: Roman Gushchin <roman.gushchin@linux.dev>
>> + */
>> +
>> +#include <linux/bpf.h>
>> +#include <linux/oom.h>
>> +#include <linux/bpf_oom.h>
>> +#include <linux/srcu.h>
>> +
>> +DEFINE_STATIC_SRCU(bpf_oom_srcu);
>> +static DEFINE_SPINLOCK(bpf_oom_lock);
>> +static LIST_HEAD(bpf_oom_handlers);
>> +
>> +struct bpf_oom {
>
> Perhaps bpf_oom_handler ? Then bpf_oom_ops->bpf_oom could be called
> bpf_oom_ops->handler.

I don't think it's a handler, it's more like a private part
of bpf_oom_ops. Maybe bpf_oom_impl? Idk

>
>
>> +       struct bpf_oom_ops *ops;
>> +       struct list_head node;
>> +       struct srcu_struct srcu;
>> +};
>> +
>> +bool bpf_handle_oom(struct oom_control *oc)
>> +{
>> +       struct bpf_oom_ops *ops;
>> +       struct bpf_oom *bpf_oom;
>> +       int list_idx, idx, ret = 0;
>> +
>> +       oc->bpf_memory_freed = false;
>> +
>> +       list_idx = srcu_read_lock(&bpf_oom_srcu);
>> +       list_for_each_entry_srcu(bpf_oom, &bpf_oom_handlers, node, false) {
>> +               ops = READ_ONCE(bpf_oom->ops);
>> +               if (!ops || !ops->handle_out_of_memory)
>> +                       continue;
>> +               idx = srcu_read_lock(&bpf_oom->srcu);
>> +               oc->bpf_policy_name = ops->name[0] ? &ops->name[0] :
>> +                       "bpf_defined_policy";
>> +               ret = ops->handle_out_of_memory(oc);
>> +               oc->bpf_policy_name = NULL;
>> +               srcu_read_unlock(&bpf_oom->srcu, idx);
>> +
>> +               if (ret && oc->bpf_memory_freed)
>
> IIUC ret and oc->bpf_memory_freed seem to reflect the same state:
> handler successfully freed some memory. Could you please clarify when
> they differ?

The idea here is to provide an additional safety measure:
if the bpf program simple returns 1 without doing anything,
the system won't deadlock.

oc->bpf_memory_freed is set by the bpf_oom_kill_process() helper
(and potentially some other helpers in the future, e.g.
bpf_oom_rm_tmpfs_file()) and can't be modified by the bpf
program directly.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 06/14] mm: introduce bpf_out_of_memory() bpf kfunc
  2025-08-19  4:09   ` Suren Baghdasaryan
@ 2025-08-19 20:16     ` Roman Gushchin
  0 siblings, 0 replies; 67+ messages in thread
From: Roman Gushchin @ 2025-08-19 20:16 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: linux-mm, bpf, Johannes Weiner, Michal Hocko, David Rientjes,
	Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel

Suren Baghdasaryan <surenb@google.com> writes:

> On Mon, Aug 18, 2025 at 10:02 AM Roman Gushchin
> <roman.gushchin@linux.dev> wrote:
>>
>> Introduce bpf_out_of_memory() bpf kfunc, which allows to declare
>> an out of memory events and trigger the corresponding kernel OOM
>> handling mechanism.
>>
>> It takes a trusted memcg pointer (or NULL for system-wide OOMs)
>> as an argument, as well as the page order.
>>
>> If the wait_on_oom_lock argument is not set, only one OOM can be
>> declared and handled in the system at once, so if the function is
>> called in parallel to another OOM handling, it bails out with -EBUSY.
>> This mode is suited for global OOM's: any concurrent OOMs will likely
>> do the job and release some memory. In a blocking mode (which is
>> suited for memcg OOMs) the execution will wait on the oom_lock mutex.
>>
>> The function is declared as sleepable. It guarantees that it won't
>> be called from an atomic context. It's required by the OOM handling
>> code, which is not guaranteed to work in a non-blocking context.
>>
>> Handling of a memcg OOM almost always requires taking of the
>> css_set_lock spinlock. The fact that bpf_out_of_memory() is sleepable
>> also guarantees that it can't be called with acquired css_set_lock,
>> so the kernel can't deadlock on it.
>>
>> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
>> ---
>>  mm/oom_kill.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 45 insertions(+)
>>
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> index 25fc5e744e27..df409f0fac45 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -1324,10 +1324,55 @@ __bpf_kfunc int bpf_oom_kill_process(struct oom_control *oc,
>>         return 0;
>>  }
>>
>> +/**
>> + * bpf_out_of_memory - declare Out Of Memory state and invoke OOM killer
>> + * @memcg__nullable: memcg or NULL for system-wide OOMs
>> + * @order: order of page which wasn't allocated
>> + * @wait_on_oom_lock: if true, block on oom_lock
>> + * @constraint_text__nullable: custom constraint description for the OOM report
>> + *
>> + * Declares the Out Of Memory state and invokes the OOM killer.
>> + *
>> + * OOM handlers are synchronized using the oom_lock mutex. If wait_on_oom_lock
>> + * is true, the function will wait on it. Otherwise it bails out with -EBUSY
>> + * if oom_lock is contended.
>> + *
>> + * Generally it's advised to pass wait_on_oom_lock=true for global OOMs
>> + * and wait_on_oom_lock=false for memcg-scoped OOMs.
>
> From the changelog description I was under impression that it's vice
> versa, for global OOMs you would not block (wait_on_oom_lock=false),
> for memcg ones you would (wait_on_oom_lock=true).

Good catch, fixed.

>
>> + *
>> + * Returns 1 if the forward progress was achieved and some memory was freed.
>> + * Returns a negative value if an error has been occurred.
>
> s/has been occurred/has occurred or occured

Same here.

Thanks!


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 11/14] sched: psi: refactor psi_trigger_create()
  2025-08-19  4:09   ` Suren Baghdasaryan
@ 2025-08-19 20:28     ` Roman Gushchin
  0 siblings, 0 replies; 67+ messages in thread
From: Roman Gushchin @ 2025-08-19 20:28 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: linux-mm, bpf, Johannes Weiner, Michal Hocko, David Rientjes,
	Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel

Suren Baghdasaryan <surenb@google.com> writes:

> On Mon, Aug 18, 2025 at 10:02 AM Roman Gushchin
> <roman.gushchin@linux.dev> wrote:
>>
>> Currently psi_trigger_create() does a lot of things:
>> parses the user text input, allocates and initializes
>> the psi_trigger structure and turns on the trigger.
>> It does it slightly different for two existing types
>> of psi_triggers: system-wide and cgroup-wide.
>>
>> In order to support a new type of psi triggers, which
>> will be owned by a bpf program and won't have a user's
>> text description, let's refactor psi_trigger_create().
>>
>> 1. Introduce psi_trigger_type enum:
>>    currently PSI_SYSTEM and PSI_CGROUP are valid values.
>> 2. Introduce psi_trigger_params structure to avoid passing
>>    a large number of parameters to psi_trigger_create().
>> 3. Move out the user's input parsing into the new
>>    psi_trigger_parse() helper.
>> 4. Move out the capabilities check into the new
>>    psi_file_privileged() helper.
>> 5. Stop relying on t->of for detecting trigger type.
>
> It's worth noting that this is a pure core refactoring without any
> functional change (hopefully :))

Added this to the commit log.

>
>>
>> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
>> ---
>>  include/linux/psi.h       | 15 +++++--
>>  include/linux/psi_types.h | 33 ++++++++++++++-
>>  kernel/cgroup/cgroup.c    | 14 ++++++-
>>  kernel/sched/psi.c        | 87 +++++++++++++++++++++++++--------------
>>  4 files changed, 112 insertions(+), 37 deletions(-)
>>
>> diff --git a/include/linux/psi.h b/include/linux/psi.h
>> index e0745873e3f2..8178e998d94b 100644
>> --- a/include/linux/psi.h
>> +++ b/include/linux/psi.h
>> @@ -23,14 +23,23 @@ void psi_memstall_enter(unsigned long *flags);
>>  void psi_memstall_leave(unsigned long *flags);
>>
>>  int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res);
>> -struct psi_trigger *psi_trigger_create(struct psi_group *group, char *buf,
>> -                                      enum psi_res res, struct file *file,
>> -                                      struct kernfs_open_file *of);
>> +int psi_trigger_parse(struct psi_trigger_params *params, const char *buf);
>> +struct psi_trigger *psi_trigger_create(struct psi_group *group,
>> +                               const struct psi_trigger_params *param);
>>  void psi_trigger_destroy(struct psi_trigger *t);
>>
>>  __poll_t psi_trigger_poll(void **trigger_ptr, struct file *file,
>>                         poll_table *wait);
>>
>> +static inline bool psi_file_privileged(struct file *file)
>> +{
>> +       /*
>> +        * Checking the privilege here on file->f_cred implies that a privileged user
>> +        * could open the file and delegate the write to an unprivileged one.
>> +        */
>> +       return cap_raised(file->f_cred->cap_effective, CAP_SYS_RESOURCE);
>> +}
>> +
>>  #ifdef CONFIG_CGROUPS
>>  static inline struct psi_group *cgroup_psi(struct cgroup *cgrp)
>>  {
>> diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
>> index f1fd3a8044e0..cea54121d9b9 100644
>> --- a/include/linux/psi_types.h
>> +++ b/include/linux/psi_types.h
>> @@ -121,7 +121,38 @@ struct psi_window {
>>         u64 prev_growth;
>>  };
>>
>> +enum psi_trigger_type {
>> +       PSI_SYSTEM,
>> +       PSI_CGROUP,
>> +};
>> +
>> +struct psi_trigger_params {
>> +       /* Trigger type */
>> +       enum psi_trigger_type type;
>> +
>> +       /* Resources that workloads could be stalled on */
>
> I would describe this as "Resource to be monitored"

Fixed.

>
>> +       enum psi_res res;
>> +
>> +       /* True if all threads should be stalled to trigger */
>> +       bool full;
>> +
>> +       /* Threshold in us */
>> +       u32 threshold_us;
>> +
>> +       /* Window in us */
>> +       u32 window_us;
>> +
>> +       /* Privileged triggers are treated differently */
>> +       bool privileged;
>> +
>> +       /* Link to kernfs open file, only for PSI_CGROUP */
>> +       struct kernfs_open_file *of;
...
>>         t->event = 0;
>>         t->last_event_time = 0;
>> -       t->of = of;
>> -       if (!of)
>> +
>> +       switch (params->type) {
>> +       case PSI_SYSTEM:
>>                 init_waitqueue_head(&t->event_wait);
>
> I think t->of will be left uninitialized here. Let's set it to NULL
> please.

Ack.

Thanks!


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 12/14] sched: psi: implement psi trigger handling using bpf
  2025-08-19  4:11   ` Suren Baghdasaryan
@ 2025-08-19 22:31     ` Roman Gushchin
  2025-08-19 23:31       ` Roman Gushchin
  0 siblings, 1 reply; 67+ messages in thread
From: Roman Gushchin @ 2025-08-19 22:31 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: linux-mm, bpf, Johannes Weiner, Michal Hocko, David Rientjes,
	Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel

Suren Baghdasaryan <surenb@google.com> writes:

> On Mon, Aug 18, 2025 at 10:02 AM Roman Gushchin
> <roman.gushchin@linux.dev> wrote:
>>
>> This patch implements a bpf struct ops-based mechanism to create
>> psi triggers, attach them to cgroups or system wide and handle
>> psi events in bpf.
>>
>> The struct ops provides 3 callbacks:
>>   - init() called once at load, handy for creating psi triggers
>>   - handle_psi_event() called every time a psi trigger fires
>>   - handle_cgroup_free() called if a cgroup with an attached
>>     trigger is being freed
>>
>> A single struct ops can create a number of psi triggers, both
>> cgroup-scoped and system-wide.
>>
>> All 3 struct ops callbacks can be sleepable. handle_psi_event()
>> handlers are executed using a separate workqueue, so it won't
>> affect the latency of other psi triggers.
>
> I'll need to stare some more into this code but overall it makes sense
> to me. Some early comments below.

Ack, thanks!


>>
>> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
>> ---
>>  include/linux/bpf_psi.h      |  71 ++++++++++
>>  include/linux/psi_types.h    |  43 +++++-
>>  kernel/sched/bpf_psi.c       | 253 +++++++++++++++++++++++++++++++++++
>>  kernel/sched/build_utility.c |   4 +
>>  kernel/sched/psi.c           |  49 +++++--
>>  5 files changed, 408 insertions(+), 12 deletions(-)
>>  create mode 100644 include/linux/bpf_psi.h
>>  create mode 100644 kernel/sched/bpf_psi.c
>>
>> diff --git a/include/linux/bpf_psi.h b/include/linux/bpf_psi.h
>> new file mode 100644
>> index 000000000000..826ab89ac11c
>> --- /dev/null
>> +++ b/include/linux/bpf_psi.h
>> @@ -0,0 +1,71 @@
>> +/* SPDX-License-Identifier: GPL-2.0+ */
>> +
>> +#ifndef __BPF_PSI_H
>> +#define __BPF_PSI_H
>> +
>> +#include <linux/list.h>
>> +#include <linux/spinlock.h>
>> +#include <linux/srcu.h>
>> +#include <linux/psi_types.h>
>> +
>> +struct cgroup;
>> +struct bpf_psi;
>> +struct psi_trigger;
>> +struct psi_trigger_params;
>> +
>> +#define BPF_PSI_FULL 0x80000000
>> +
>> +struct bpf_psi_ops {
>> +       /**
>> +        * @init: Initialization callback, suited for creating psi triggers.
>> +        * @bpf_psi: bpf_psi pointer, can be passed to bpf_psi_create_trigger().
>> +        *
>> +        * A non-0 return value means the initialization has been failed.
>> +        */
>> +       int (*init)(struct bpf_psi *bpf_psi);
>> +
>> +       /**
>> +        * @handle_psi_event: PSI event callback
>> +        * @t: psi_trigger pointer
>> +        */
>> +       void (*handle_psi_event)(struct psi_trigger *t);
>> +
>> +       /**
>> +        * @handle_cgroup_free: Cgroup free callback
>> +        * @cgroup_id: Id of freed cgroup
>> +        *
>> +        * Called every time a cgroup with an attached bpf psi trigger is freed.
>> +        * No psi events can be raised after handle_cgroup_free().
>> +        */
>> +       void (*handle_cgroup_free)(u64 cgroup_id);
>> +
>> +       /* private */
>> +       struct bpf_psi *bpf_psi;
>> +};
>> +
>> +struct bpf_psi {
>> +       spinlock_t lock;
>> +       struct list_head triggers;
>> +       struct bpf_psi_ops *ops;
>> +       struct srcu_struct srcu;
>> +};
>> +
>> +#ifdef CONFIG_BPF_SYSCALL
>> +void bpf_psi_add_trigger(struct psi_trigger *t,
>> +                        const struct psi_trigger_params *params);
>> +void bpf_psi_remove_trigger(struct psi_trigger *t);
>> +void bpf_psi_handle_event(struct psi_trigger *t);
>> +#ifdef CONFIG_CGROUPS
>> +void bpf_psi_cgroup_free(struct cgroup *cgroup);
>> +#endif
>> +
>> +#else /* CONFIG_BPF_SYSCALL */
>> +static inline void bpf_psi_add_trigger(struct psi_trigger *t,
>> +                       const struct psi_trigger_params *params) {}
>> +static inline void bpf_psi_remove_trigger(struct psi_trigger *t) {}
>> +static inline void bpf_psi_handle_event(struct psi_trigger *t) {}
>> +static inline void bpf_psi_cgroup_free(struct cgroup *cgroup) {}
>> +
>> +#endif /* CONFIG_BPF_SYSCALL */
>> +
>> +#endif /* __BPF_PSI_H */
>> diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
>> index cea54121d9b9..f695cc34cfd4 100644
>> --- a/include/linux/psi_types.h
>> +++ b/include/linux/psi_types.h
>> @@ -124,6 +124,7 @@ struct psi_window {
>>  enum psi_trigger_type {
>>         PSI_SYSTEM,
>>         PSI_CGROUP,
>> +       PSI_BPF,
>>  };
>>
>>  struct psi_trigger_params {
>> @@ -145,8 +146,15 @@ struct psi_trigger_params {
>>         /* Privileged triggers are treated differently */
>>         bool privileged;
>>
>> -       /* Link to kernfs open file, only for PSI_CGROUP */
>> -       struct kernfs_open_file *of;
>> +       union {
>> +               /* Link to kernfs open file, only for PSI_CGROUP */
>> +               struct kernfs_open_file *of;
>> +
>> +#ifdef CONFIG_BPF_SYSCALL
>> +               /* Link to bpf_psi structure, only for BPF_PSI */
>> +               struct bpf_psi *bpf_psi;
>> +#endif
>> +       };
>>  };
>>
>>  struct psi_trigger {
>> @@ -188,6 +196,31 @@ struct psi_trigger {
>>
>>         /* Trigger type - PSI_AVGS for unprivileged, PSI_POLL for RT */
>>         enum psi_aggregators aggregator;
>> +
>> +#ifdef CONFIG_BPF_SYSCALL
>> +       /* Fields specific to PSI_BPF triggers */
>> +
>> +       /* Bpf psi structure for events handling */
>> +       struct bpf_psi *bpf_psi;
>> +
>> +       /* List node inside bpf_psi->triggers list */
>> +       struct list_head bpf_psi_node;
>> +
>> +       /* List node inside group->bpf_triggers list */
>> +       struct list_head bpf_group_node;
>> +
>> +       /* Work structure, used to execute event handlers */
>> +       struct work_struct bpf_work;
>
> I think bpf_work can be moved into struct bpf_psi as you are using it
> get to bpf_psi anyway:
>
>        t = container_of(work, struct psi_trigger, bpf_work);
>        bpf_psi = READ_ONCE(t->bpf_psi);

Not really.
The problem is that for every bpf_psi structure there can bu multiple
triggers. E.g. a trigger for each cgroup. We should be able to handle
them independently.

>> +
>> +       /*
>> +        * Whether the trigger is being pinned in memory.
>> +        * Protected by group->bpf_triggers_lock.
>> +        */
>> +       bool pinned;
>
> Same with pinned field. I think you are using it only with triggers
> which have a valid t->bpf_work, so might as well move in there. I
> would also call this field "isolated" rather than "pinned" but that's
> just a preference.

Here the problem is that a bpf trigger can be destroyed from 2 sides:
if the struct_ops is unloaded or the corresponding cgroup is being
freed. Each trigger sits on two lists: a list of triggers attached
to a specific group and a list of triggers owned by a bpf_psi.
This makes the locking a bit complicated and this is why I need
this pinned flag. It's pinning triggers, not bpf_psi.

>
>> +
>> +       /* Cgroup Id */
>> +       u64 cgroup_id;
>
> This cgroup_id field is weird. It's not initialized and not used here,
> then it gets initialized in the next patch and used in the last patch
> from a selftest. This is quite confusing. Also logically I don't think
> a cgroup attribute really belongs to psi_trigger... Can we at least
> move it into bpf_psi where it might fit a bit better?

I can't move it to bpf_psi, because a single bpf_psi might own multiple
triggers with different cgroup_id's.
For sure I can move it to the next patch, if it's preferred.

If you really don't like it here, other option is to replace it with
a new bpf helper (kfunc) which calculates the cgroup_id by walking the
trigger->group->cgroup->cgroup_id path each time.

>
>> +#endif
>>  };
>>
>>  struct psi_group {
>> @@ -236,6 +269,12 @@ struct psi_group {
>>         u64 rtpoll_total[NR_PSI_STATES - 1];
>>         u64 rtpoll_next_update;
>>         u64 rtpoll_until;
>> +
>> +#ifdef CONFIG_BPF_SYSCALL
>> +       /* List of triggers owned by bpf and corresponding lock */
>> +       spinlock_t bpf_triggers_lock;
>> +       struct list_head bpf_triggers;
>> +#endif
>>  };
>>
>>  #else /* CONFIG_PSI */
>> diff --git a/kernel/sched/bpf_psi.c b/kernel/sched/bpf_psi.c
>> new file mode 100644
>> index 000000000000..2ea9d7276b21
>> --- /dev/null
>> +++ b/kernel/sched/bpf_psi.c
>> @@ -0,0 +1,253 @@
>> +// SPDX-License-Identifier: GPL-2.0-or-later
>> +/*
>> + * BPF PSI event handlers
>> + *
>> + * Author: Roman Gushchin <roman.gushchin@linux.dev>
>> + */
>> +
>> +#include <linux/bpf_psi.h>
>> +#include <linux/cgroup-defs.h>
>> +
>> +static struct workqueue_struct *bpf_psi_wq;
>> +
>> +static struct bpf_psi *bpf_psi_create(struct bpf_psi_ops *ops)
>> +{
>> +       struct bpf_psi *bpf_psi;
>> +
>> +       bpf_psi = kzalloc(sizeof(*bpf_psi), GFP_KERNEL);
>> +       if (!bpf_psi)
>> +               return NULL;
>> +
>> +       if (init_srcu_struct(&bpf_psi->srcu)) {
>> +               kfree(bpf_psi);
>> +               return NULL;
>> +       }
>> +
>> +       spin_lock_init(&bpf_psi->lock);
>> +       bpf_psi->ops = ops;
>> +       INIT_LIST_HEAD(&bpf_psi->triggers);
>> +       ops->bpf_psi = bpf_psi;
>> +
>> +       return bpf_psi;
>> +}
>> +
>> +static void bpf_psi_free(struct bpf_psi *bpf_psi)
>> +{
>> +       cleanup_srcu_struct(&bpf_psi->srcu);
>> +       kfree(bpf_psi);
>> +}
>> +
>> +static void bpf_psi_handle_event_fn(struct work_struct *work)
>> +{
>> +       struct psi_trigger *t;
>> +       struct bpf_psi *bpf_psi;
>> +       int idx;
>> +
>> +       t = container_of(work, struct psi_trigger, bpf_work);
>> +       bpf_psi = READ_ONCE(t->bpf_psi);
>> +
>> +       if (likely(bpf_psi)) {
>> +               idx = srcu_read_lock(&bpf_psi->srcu);
>> +               if (bpf_psi->ops->handle_psi_event)
>> +                       bpf_psi->ops->handle_psi_event(t);
>> +               srcu_read_unlock(&bpf_psi->srcu, idx);
>> +       }
>> +}
>> +
>> +void bpf_psi_add_trigger(struct psi_trigger *t,
>> +                        const struct psi_trigger_params *params)
>> +{
>> +       t->bpf_psi = params->bpf_psi;
>> +       t->pinned = false;
>> +       INIT_WORK(&t->bpf_work, bpf_psi_handle_event_fn);
>> +
>> +       spin_lock(&t->bpf_psi->lock);
>> +       list_add(&t->bpf_psi_node, &t->bpf_psi->triggers);
>> +       spin_unlock(&t->bpf_psi->lock);
>> +
>> +       spin_lock(&t->group->bpf_triggers_lock);
>> +       list_add(&t->bpf_group_node, &t->group->bpf_triggers);
>> +       spin_unlock(&t->group->bpf_triggers_lock);
>> +}
>> +
>> +void bpf_psi_remove_trigger(struct psi_trigger *t)
>> +{
>> +       spin_lock(&t->group->bpf_triggers_lock);
>> +       list_del(&t->bpf_group_node);
>> +       spin_unlock(&t->group->bpf_triggers_lock);
>> +
>> +       spin_lock(&t->bpf_psi->lock);
>> +       list_del(&t->bpf_psi_node);
>> +       spin_unlock(&t->bpf_psi->lock);
>> +}
>> +
>> +#ifdef CONFIG_CGROUPS
>> +void bpf_psi_cgroup_free(struct cgroup *cgroup)
>> +{
>> +       struct psi_group *group = cgroup->psi;
>> +       u64 cgrp_id = cgroup_id(cgroup);
>> +       struct psi_trigger *t, *p;
>> +       struct bpf_psi *bpf_psi;
>> +       LIST_HEAD(to_destroy);
>> +       int idx;
>> +
>> +       spin_lock(&group->bpf_triggers_lock);
>> +       list_for_each_entry_safe(t, p, &group->bpf_triggers, bpf_group_node) {
>> +               if (!t->pinned) {
>> +                       t->pinned = true;
>> +                       list_move(&t->bpf_group_node, &to_destroy);
>> +               }
>> +       }
>> +       spin_unlock(&group->bpf_triggers_lock);
>> +
>> +       list_for_each_entry_safe(t, p, &to_destroy, bpf_group_node) {
>> +               bpf_psi = READ_ONCE(t->bpf_psi);
>> +
>> +               idx = srcu_read_lock(&bpf_psi->srcu);
>> +               if (bpf_psi->ops->handle_cgroup_free)
>> +                       bpf_psi->ops->handle_cgroup_free(cgrp_id);
>> +               srcu_read_unlock(&bpf_psi->srcu, idx);
>> +
>> +               spin_lock(&bpf_psi->lock);
>> +               list_del(&t->bpf_psi_node);
>> +               spin_unlock(&bpf_psi->lock);
>> +
>> +               WRITE_ONCE(t->bpf_psi, NULL);
>> +               flush_workqueue(bpf_psi_wq);
>> +               synchronize_srcu(&bpf_psi->srcu);
>> +               psi_trigger_destroy(t);
>> +       }
>> +}
>> +#endif
>> +
>> +void bpf_psi_handle_event(struct psi_trigger *t)
>> +{
>> +       queue_work(bpf_psi_wq, &t->bpf_work);
>> +}
>> +
>> +// bpf struct ops
>
> C++ style comment?

Fixed.

>> +
>> +static int __bpf_psi_init(struct bpf_psi *bpf_psi) { return 0; }
>> +static void __bpf_psi_handle_psi_event(struct psi_trigger *t) {}
>> +static void __bpf_psi_handle_cgroup_free(u64 cgroup_id) {}
>> +
>> +static struct bpf_psi_ops __bpf_psi_ops = {
>> +       .init = __bpf_psi_init,
>> +       .handle_psi_event = __bpf_psi_handle_psi_event,
>> +       .handle_cgroup_free = __bpf_psi_handle_cgroup_free,
>> +};
>> +
>> +static const struct bpf_func_proto *
>> +bpf_psi_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>> +{
>> +       return tracing_prog_func_proto(func_id, prog);
>> +}
>> +
>> +static bool bpf_psi_ops_is_valid_access(int off, int size,
>> +                                       enum bpf_access_type type,
>> +                                       const struct bpf_prog *prog,
>> +                                       struct bpf_insn_access_aux *info)
>> +{
>> +       return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
>> +}
>> +
>> +static const struct bpf_verifier_ops bpf_psi_verifier_ops = {
>> +       .get_func_proto = bpf_psi_func_proto,
>> +       .is_valid_access = bpf_psi_ops_is_valid_access,
>> +};
>> +
>> +static int bpf_psi_ops_reg(void *kdata, struct bpf_link *link)
>> +{
>> +       struct bpf_psi_ops *ops = kdata;
>> +       struct bpf_psi *bpf_psi;
>> +
>> +       bpf_psi = bpf_psi_create(ops);
>> +       if (!bpf_psi)
>> +               return -ENOMEM;
>> +
>> +       return ops->init(bpf_psi);
>> +}
>> +
>> +static void bpf_psi_ops_unreg(void *kdata, struct bpf_link *link)
>> +{
>> +       struct bpf_psi_ops *ops = kdata;
>> +       struct bpf_psi *bpf_psi = ops->bpf_psi;
>> +       struct psi_trigger *t, *p;
>> +       LIST_HEAD(to_destroy);
>> +
>> +       spin_lock(&bpf_psi->lock);
>> +       list_for_each_entry_safe(t, p, &bpf_psi->triggers, bpf_psi_node) {
>> +               spin_lock(&t->group->bpf_triggers_lock);
>> +               if (!t->pinned) {
>> +                       t->pinned = true;
>> +                       list_move(&t->bpf_group_node, &to_destroy);
>> +                       list_del(&t->bpf_psi_node);
>> +
>> +                       WRITE_ONCE(t->bpf_psi, NULL);
>> +               }
>> +               spin_unlock(&t->group->bpf_triggers_lock);
>> +       }
>> +       spin_unlock(&bpf_psi->lock);
>> +
>> +       flush_workqueue(bpf_psi_wq);
>> +       synchronize_srcu(&bpf_psi->srcu);
>> +
>> +       list_for_each_entry_safe(t, p, &to_destroy, bpf_group_node)
>> +               psi_trigger_destroy(t);
>> +
>> +       bpf_psi_free(bpf_psi);
>> +}
>> +
>> +static int bpf_psi_ops_check_member(const struct btf_type *t,
>> +                                   const struct btf_member *member,
>> +                                   const struct bpf_prog *prog)
>> +{
>> +       return 0;
>> +}
>> +
>> +static int bpf_psi_ops_init_member(const struct btf_type *t,
>> +                                  const struct btf_member *member,
>> +                                  void *kdata, const void *udata)
>> +{
>> +       return 0;
>> +}
>> +
>> +static int bpf_psi_ops_init(struct btf *btf)
>> +{
>> +       return 0;
>> +}
>> +
>> +static struct bpf_struct_ops bpf_psi_bpf_ops = {
>> +       .verifier_ops = &bpf_psi_verifier_ops,
>> +       .reg = bpf_psi_ops_reg,
>> +       .unreg = bpf_psi_ops_unreg,
>> +       .check_member = bpf_psi_ops_check_member,
>> +       .init_member = bpf_psi_ops_init_member,
>> +       .init = bpf_psi_ops_init,
>> +       .name = "bpf_psi_ops",
>> +       .owner = THIS_MODULE,
>> +       .cfi_stubs = &__bpf_psi_ops
>> +};
>> +
>> +static int __init bpf_psi_struct_ops_init(void)
>> +{
>> +       int wq_flags = WQ_MEM_RECLAIM | WQ_UNBOUND | WQ_HIGHPRI;
>> +       int err;
>> +
>> +       bpf_psi_wq = alloc_workqueue("bpf_psi_wq", wq_flags, 0);
>> +       if (!bpf_psi_wq)
>> +               return -ENOMEM;
>> +
>> +       err = register_bpf_struct_ops(&bpf_psi_bpf_ops, bpf_psi_ops);
>> +       if (err) {
>> +               pr_warn("error while registering bpf psi struct ops: %d", err);
>> +               goto err;
>> +       }
>> +
>> +       return 0;
>> +
>> +err:
>> +       destroy_workqueue(bpf_psi_wq);
>> +       return err;
>> +}
>> +late_initcall(bpf_psi_struct_ops_init);
>> diff --git a/kernel/sched/build_utility.c b/kernel/sched/build_utility.c
>> index bf9d8db94b70..80f3799a2fa6 100644
>> --- a/kernel/sched/build_utility.c
>> +++ b/kernel/sched/build_utility.c
>> @@ -19,6 +19,7 @@
>>  #include <linux/sched/rseq_api.h>
>>  #include <linux/sched/task_stack.h>
>>
>> +#include <linux/bpf_psi.h>
>>  #include <linux/cpufreq.h>
>>  #include <linux/cpumask_api.h>
>>  #include <linux/cpuset.h>
>> @@ -92,6 +93,9 @@
>>
>>  #ifdef CONFIG_PSI
>>  # include "psi.c"
>> +# ifdef CONFIG_BPF_SYSCALL
>> +#  include "bpf_psi.c"
>> +# endif
>>  #endif
>>
>>  #ifdef CONFIG_MEMBARRIER
>> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
>> index e1d8eaeeff17..e10fbbc34099 100644
>> --- a/kernel/sched/psi.c
>> +++ b/kernel/sched/psi.c
>> @@ -201,6 +201,10 @@ static void group_init(struct psi_group *group)
>>         init_waitqueue_head(&group->rtpoll_wait);
>>         timer_setup(&group->rtpoll_timer, poll_timer_fn, 0);
>>         rcu_assign_pointer(group->rtpoll_task, NULL);
>> +#ifdef CONFIG_BPF_SYSCALL
>> +       spin_lock_init(&group->bpf_triggers_lock);
>> +       INIT_LIST_HEAD(&group->bpf_triggers);
>> +#endif
>>  }
>>
>>  void __init psi_init(void)
>> @@ -489,10 +493,17 @@ static void update_triggers(struct psi_group *group, u64 now,
>>
>>                 /* Generate an event */
>>                 if (cmpxchg(&t->event, 0, 1) == 0) {
>> -                       if (t->type == PSI_CGROUP)
>> -                               kernfs_notify(t->of->kn);
>> -                       else
>> +                       switch (t->type) {
>> +                       case PSI_SYSTEM:
>>                                 wake_up_interruptible(&t->event_wait);
>> +                               break;
>> +                       case PSI_CGROUP:
>> +                               kernfs_notify(t->of->kn);
>> +                               break;
>> +                       case PSI_BPF:
>> +                               bpf_psi_handle_event(t);
>> +                               break;
>> +                       }
>>                 }
>>                 t->last_event_time = now;
>>                 /* Reset threshold breach flag once event got generated */
>> @@ -1125,6 +1136,7 @@ void psi_cgroup_free(struct cgroup *cgroup)
>>                 return;
>>
>>         cancel_delayed_work_sync(&cgroup->psi->avgs_work);
>> +       bpf_psi_cgroup_free(cgroup);
>>         free_percpu(cgroup->psi->pcpu);
>>         /* All triggers must be removed by now */
>>         WARN_ONCE(cgroup->psi->rtpoll_states, "psi: trigger leak\n");
>> @@ -1356,6 +1368,9 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group,
>>         case PSI_CGROUP:
>>                 t->of = params->of;
>>                 break;
>> +       case PSI_BPF:
>> +               bpf_psi_add_trigger(t, params);
>> +               break;
>>         }
>>
>>         t->pending_event = false;
>> @@ -1369,8 +1384,10 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group,
>>
>>                         task = kthread_create(psi_rtpoll_worker, group, "psimon");
>>                         if (IS_ERR(task)) {
>> -                               kfree(t);
>>                                 mutex_unlock(&group->rtpoll_trigger_lock);
>> +                               if (t->type == PSI_BPF)
>> +                                       bpf_psi_remove_trigger(t);
>> +                               kfree(t);
>>                                 return ERR_CAST(task);
>>                         }
>>                         atomic_set(&group->rtpoll_wakeup, 0);
>> @@ -1414,10 +1431,16 @@ void psi_trigger_destroy(struct psi_trigger *t)
>
> Will this function be ever called for PSI_BPF triggers? 

Yes, from bpf_psi_cgroup_free() and bpf_psi_ops_unreg()

> Same question for psi_trigger_poll().

No.

And btw thank you for reviewing the series, highly appreciated!


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 12/14] sched: psi: implement psi trigger handling using bpf
  2025-08-19 22:31     ` Roman Gushchin
@ 2025-08-19 23:31       ` Roman Gushchin
  2025-08-20 23:56         ` Suren Baghdasaryan
  0 siblings, 1 reply; 67+ messages in thread
From: Roman Gushchin @ 2025-08-19 23:31 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: linux-mm, bpf, Johannes Weiner, Michal Hocko, David Rientjes,
	Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel

Roman Gushchin <roman.gushchin@linux.dev> writes:

> Suren Baghdasaryan <surenb@google.com> writes:
>
>> On Mon, Aug 18, 2025 at 10:02 AM Roman Gushchin
>> <roman.gushchin@linux.dev> wrote:
>
>>
>>> +
>>> +       /* Cgroup Id */
>>> +       u64 cgroup_id;
>>
>> This cgroup_id field is weird. It's not initialized and not used here,
>> then it gets initialized in the next patch and used in the last patch
>> from a selftest. This is quite confusing. Also logically I don't think
>> a cgroup attribute really belongs to psi_trigger... Can we at least
>> move it into bpf_psi where it might fit a bit better?
>
> I can't move it to bpf_psi, because a single bpf_psi might own multiple
> triggers with different cgroup_id's.
> For sure I can move it to the next patch, if it's preferred.
>
> If you really don't like it here, other option is to replace it with
> a new bpf helper (kfunc) which calculates the cgroup_id by walking the
> trigger->group->cgroup->cgroup_id path each time.

Actually there is no easy path from psi_group to cgroup, so there is
no such option available, unfortunately. Or we need a back-link from
the psi_group to cgroup.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 02/14] bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL
  2025-08-18 17:01 ` [PATCH v1 02/14] bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL Roman Gushchin
@ 2025-08-20  9:17   ` Kumar Kartikeya Dwivedi
  2025-08-20 22:32     ` Roman Gushchin
  0 siblings, 1 reply; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-08-20  9:17 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, bpf, Suren Baghdasaryan, Johannes Weiner, Michal Hocko,
	David Rientjes, Matt Bobrowski, Song Liu, Alexei Starovoitov,
	Andrew Morton, linux-kernel

On Mon, 18 Aug 2025 at 19:01, Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> Struct oom_control is used to describe the OOM context.
> It's memcg field defines the scope of OOM: it's NULL for global
> OOMs and a valid memcg pointer for memcg-scoped OOMs.
> Teach bpf verifier to recognize it as trusted or NULL pointer.
> It will provide the bpf OOM handler a trusted memcg pointer,
> which for example is required for iterating the memcg's subtree.
>
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> ---

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 04/14] mm: introduce bpf kfuncs to deal with memcg pointers
  2025-08-18 17:01 ` [PATCH v1 04/14] mm: introduce bpf kfuncs to deal with memcg pointers Roman Gushchin
@ 2025-08-20  9:21   ` Kumar Kartikeya Dwivedi
  2025-08-20 22:43     ` Roman Gushchin
  0 siblings, 1 reply; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-08-20  9:21 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, bpf, Suren Baghdasaryan, Johannes Weiner, Michal Hocko,
	David Rientjes, Matt Bobrowski, Song Liu, Alexei Starovoitov,
	Andrew Morton, linux-kernel

On Mon, 18 Aug 2025 at 19:02, Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> To effectively operate with memory cgroups in bpf there is a need
> to convert css pointers to memcg pointers. A simple container_of
> cast which is used in the kernel code can't be used in bpf because
> from the verifier's point of view that's a out-of-bounds memory access.
>
> Introduce helper get/put kfuncs which can be used to get
> a refcounted memcg pointer from the css pointer:
>   - bpf_get_mem_cgroup,
>   - bpf_put_mem_cgroup.
>
> bpf_get_mem_cgroup() can take both memcg's css and the corresponding
> cgroup's "self" css. It allows it to be used with the existing cgroup
> iterator which iterates over cgroup tree, not memcg tree.
>
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> ---
>  include/linux/memcontrol.h |   2 +
>  mm/Makefile                |   1 +
>  mm/bpf_memcontrol.c        | 151 +++++++++++++++++++++++++++++++++++++
>  3 files changed, 154 insertions(+)
>  create mode 100644 mm/bpf_memcontrol.c
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 87b6688f124a..785a064000cd 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -932,6 +932,8 @@ static inline void mod_memcg_page_state(struct page *page,
>         rcu_read_unlock();
>  }
>
> +unsigned long memcg_events(struct mem_cgroup *memcg, int event);
> +unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap);
>  unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx);
>  unsigned long lruvec_page_state(struct lruvec *lruvec, enum node_stat_item idx);
>  unsigned long lruvec_page_state_local(struct lruvec *lruvec,
> diff --git a/mm/Makefile b/mm/Makefile
> index a714aba03759..c397af904a87 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -107,6 +107,7 @@ obj-$(CONFIG_MEMCG) += swap_cgroup.o
>  endif
>  ifdef CONFIG_BPF_SYSCALL
>  obj-y += bpf_oom.o
> +obj-$(CONFIG_MEMCG) += bpf_memcontrol.o
>  endif
>  obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
>  obj-$(CONFIG_GUP_TEST) += gup_test.o
> diff --git a/mm/bpf_memcontrol.c b/mm/bpf_memcontrol.c
> new file mode 100644
> index 000000000000..66f2a359af7e
> --- /dev/null
> +++ b/mm/bpf_memcontrol.c
> @@ -0,0 +1,151 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Memory Controller-related BPF kfuncs and auxiliary code
> + *
> + * Author: Roman Gushchin <roman.gushchin@linux.dev>
> + */
> +
> +#include <linux/memcontrol.h>
> +#include <linux/bpf.h>
> +
> +__bpf_kfunc_start_defs();
> +
> +/**
> + * bpf_get_mem_cgroup - Get a reference to a memory cgroup
> + * @css: pointer to the css structure
> + *
> + * Returns a pointer to a mem_cgroup structure after bumping
> + * the corresponding css's reference counter.
> + *
> + * It's fine to pass a css which belongs to any cgroup controller,
> + * e.g. unified hierarchy's main css.
> + *
> + * Implements KF_ACQUIRE semantics.
> + */
> +__bpf_kfunc struct mem_cgroup *
> +bpf_get_mem_cgroup(struct cgroup_subsys_state *css)
> +{
> +       struct mem_cgroup *memcg = NULL;
> +       bool rcu_unlock = false;
> +
> +       if (!root_mem_cgroup)
> +               return NULL;
> +
> +       if (root_mem_cgroup->css.ss != css->ss) {
> +               struct cgroup *cgroup = css->cgroup;
> +               int ssid = root_mem_cgroup->css.ss->id;
> +
> +               rcu_read_lock();
> +               rcu_unlock = true;
> +               css = rcu_dereference_raw(cgroup->subsys[ssid]);
> +       }
> +
> +       if (css && css_tryget(css))
> +               memcg = container_of(css, struct mem_cgroup, css);
> +
> +       if (rcu_unlock)
> +               rcu_read_unlock();
> +
> +       return memcg;
> +}
> +
> +/**
> + * bpf_put_mem_cgroup - Put a reference to a memory cgroup
> + * @memcg: memory cgroup to release
> + *
> + * Releases a previously acquired memcg reference.
> + * Implements KF_RELEASE semantics.
> + */
> +__bpf_kfunc void bpf_put_mem_cgroup(struct mem_cgroup *memcg)
> +{
> +       css_put(&memcg->css);
> +}
> +
> +/**
> + * bpf_mem_cgroup_events - Read memory cgroup's event counter
> + * @memcg: memory cgroup
> + * @event: event idx
> + *
> + * Allows to read memory cgroup event counters.
> + */
> +__bpf_kfunc unsigned long bpf_mem_cgroup_events(struct mem_cgroup *memcg, int event)
> +{
> +
> +       if (event < 0 || event >= NR_VM_EVENT_ITEMS)
> +               return (unsigned long)-1;
> +
> +       return memcg_events(memcg, event);
> +}
> +
> +/**
> + * bpf_mem_cgroup_usage - Read memory cgroup's usage
> + * @memcg: memory cgroup
> + *
> + * Returns current memory cgroup size in bytes.
> + */
> +__bpf_kfunc unsigned long bpf_mem_cgroup_usage(struct mem_cgroup *memcg)
> +{
> +       return page_counter_read(&memcg->memory);
> +}
> +
> +/**
> + * bpf_mem_cgroup_events - Read memory cgroup's page state counter
> + * @memcg: memory cgroup
> + * @event: event idx
> + *
> + * Allows to read memory cgroup statistics.
> + */
> +__bpf_kfunc unsigned long bpf_mem_cgroup_page_state(struct mem_cgroup *memcg, int idx)
> +{
> +       if (idx < 0 || idx >= MEMCG_NR_STAT)
> +               return (unsigned long)-1;
> +
> +       return memcg_page_state(memcg, idx);
> +}
> +
> +/**
> + * bpf_mem_cgroup_flush_stats - Flush memory cgroup's statistics
> + * @memcg: memory cgroup
> + *
> + * Propagate memory cgroup's statistics up the cgroup tree.
> + *
> + * Note, that this function uses the rate-limited version of
> + * mem_cgroup_flush_stats() to avoid hurting the system-wide
> + * performance. So bpf_mem_cgroup_flush_stats() guarantees only
> + * that statistics is not stale beyond 2*FLUSH_TIME.
> + */
> +__bpf_kfunc void bpf_mem_cgroup_flush_stats(struct mem_cgroup *memcg)
> +{
> +       mem_cgroup_flush_stats_ratelimited(memcg);
> +}
> +
> +__bpf_kfunc_end_defs();
> +
> +BTF_KFUNCS_START(bpf_memcontrol_kfuncs)
> +BTF_ID_FLAGS(func, bpf_get_mem_cgroup, KF_ACQUIRE | KF_RET_NULL)

I think you could set KF_TRUSTED_ARGS for this as well.

> +BTF_ID_FLAGS(func, bpf_put_mem_cgroup, KF_RELEASE)
> +
> +BTF_ID_FLAGS(func, bpf_mem_cgroup_events, KF_TRUSTED_ARGS)
> +BTF_ID_FLAGS(func, bpf_mem_cgroup_usage, KF_TRUSTED_ARGS)
> +BTF_ID_FLAGS(func, bpf_mem_cgroup_page_state, KF_TRUSTED_ARGS)
> +BTF_ID_FLAGS(func, bpf_mem_cgroup_flush_stats, KF_TRUSTED_ARGS)
> +
> +BTF_KFUNCS_END(bpf_memcontrol_kfuncs)
> +
> +static const struct btf_kfunc_id_set bpf_memcontrol_kfunc_set = {
> +       .owner          = THIS_MODULE,
> +       .set            = &bpf_memcontrol_kfuncs,
> +};
> +
> +static int __init bpf_memcontrol_init(void)
> +{
> +       int err;
> +
> +       err = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
> +                                       &bpf_memcontrol_kfunc_set);
> +       if (err)
> +               pr_warn("error while registering bpf memcontrol kfuncs: %d", err);
> +
> +       return err;
> +}
> +late_initcall(bpf_memcontrol_init);
> --
> 2.50.1
>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 05/14] mm: introduce bpf_get_root_mem_cgroup() bpf kfunc
  2025-08-18 17:01 ` [PATCH v1 05/14] mm: introduce bpf_get_root_mem_cgroup() bpf kfunc Roman Gushchin
@ 2025-08-20  9:25   ` Kumar Kartikeya Dwivedi
  2025-08-20 22:45     ` Roman Gushchin
  0 siblings, 1 reply; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-08-20  9:25 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, bpf, Suren Baghdasaryan, Johannes Weiner, Michal Hocko,
	David Rientjes, Matt Bobrowski, Song Liu, Alexei Starovoitov,
	Andrew Morton, linux-kernel

On Mon, 18 Aug 2025 at 19:02, Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> Introduce a bpf kfunc to get a trusted pointer to the root memory
> cgroup. It's very handy to traverse the full memcg tree, e.g.
> for handling a system-wide OOM.
>
> It's possible to obtain this pointer by traversing the memcg tree
> up from any known memcg, but it's sub-optimal and makes bpf programs
> more complex and less efficient.
>
> bpf_get_root_mem_cgroup() has a KF_ACQUIRE | KF_RET_NULL semantics,
> however in reality it's not necessarily to bump the corresponding
> reference counter - root memory cgroup is immortal, reference counting
> is skipped, see css_get(). Once set, root_mem_cgroup is always a valid
> memcg pointer. It's safe to call bpf_put_mem_cgroup() for the pointer
> obtained with bpf_get_root_mem_cgroup(), it's effectively a no-op.
>
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> ---
>  mm/bpf_memcontrol.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
>
> diff --git a/mm/bpf_memcontrol.c b/mm/bpf_memcontrol.c
> index 66f2a359af7e..a8faa561bcba 100644
> --- a/mm/bpf_memcontrol.c
> +++ b/mm/bpf_memcontrol.c
> @@ -10,6 +10,20 @@
>
>  __bpf_kfunc_start_defs();
>
> +/**
> + * bpf_get_root_mem_cgroup - Returns a pointer to the root memory cgroup
> + *
> + * The function has KF_ACQUIRE semantics, even though the root memory
> + * cgroup is never destroyed after being created and doesn't require
> + * reference counting. And it's perfectly safe to pass it to
> + * bpf_put_mem_cgroup()
> + */
> +__bpf_kfunc struct mem_cgroup *bpf_get_root_mem_cgroup(void)
> +{
> +       /* css_get() is not needed */
> +       return root_mem_cgroup;
> +}
> +
>  /**
>   * bpf_get_mem_cgroup - Get a reference to a memory cgroup
>   * @css: pointer to the css structure
> @@ -122,6 +136,7 @@ __bpf_kfunc void bpf_mem_cgroup_flush_stats(struct mem_cgroup *memcg)
>  __bpf_kfunc_end_defs();
>
>  BTF_KFUNCS_START(bpf_memcontrol_kfuncs)
> +BTF_ID_FLAGS(func, bpf_get_root_mem_cgroup, KF_ACQUIRE | KF_RET_NULL)

Same suggestion here (re: trusted args).

>  BTF_ID_FLAGS(func, bpf_get_mem_cgroup, KF_ACQUIRE | KF_RET_NULL)
>  BTF_ID_FLAGS(func, bpf_put_mem_cgroup, KF_RELEASE)
>
> --
> 2.50.1
>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 10/14] bpf: selftests: bpf OOM handler test
  2025-08-18 17:01 ` [PATCH v1 10/14] bpf: selftests: bpf OOM handler test Roman Gushchin
@ 2025-08-20  9:33   ` Kumar Kartikeya Dwivedi
  2025-08-20 22:49     ` Roman Gushchin
  2025-08-20 20:23   ` Andrii Nakryiko
  1 sibling, 1 reply; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-08-20  9:33 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, bpf, Suren Baghdasaryan, Johannes Weiner, Michal Hocko,
	David Rientjes, Matt Bobrowski, Song Liu, Alexei Starovoitov,
	Andrew Morton, linux-kernel

On Mon, 18 Aug 2025 at 19:02, Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> Implement a pseudo-realistic test for the OOM handling
> functionality.
>
> The OOM handling policy which is implemented in bpf is to
> kill all tasks belonging to the biggest leaf cgroup, which
> doesn't contain unkillable tasks (tasks with oom_score_adj
> set to -1000). Pagecache size is excluded from the accounting.
>
> The test creates a hierarchy of memory cgroups, causes an
> OOM at the top level, checks that the expected process will be
> killed and checks memcg's oom statistics.
>
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> ---
>  [...]
> +
> +/*
> + * Find the largest leaf cgroup (ignoring page cache) without unkillable tasks
> + * and kill all belonging tasks.
> + */
> +SEC("struct_ops.s/handle_out_of_memory")
> +int BPF_PROG(test_out_of_memory, struct oom_control *oc)
> +{
> +       struct task_struct *task;
> +       struct mem_cgroup *root_memcg = oc->memcg;
> +       struct mem_cgroup *memcg, *victim = NULL;
> +       struct cgroup_subsys_state *css_pos;
> +       unsigned long usage, max_usage = 0;
> +       unsigned long pagecache = 0;
> +       int ret = 0;
> +
> +       if (root_memcg)
> +               root_memcg = bpf_get_mem_cgroup(&root_memcg->css);
> +       else
> +               root_memcg = bpf_get_root_mem_cgroup();
> +
> +       if (!root_memcg)
> +               return 0;
> +
> +       bpf_rcu_read_lock();
> +       bpf_for_each(css, css_pos, &root_memcg->css, BPF_CGROUP_ITER_DESCENDANTS_POST) {
> +               if (css_pos->cgroup->nr_descendants + css_pos->cgroup->nr_dying_descendants)
> +                       continue;
> +
> +               memcg = bpf_get_mem_cgroup(css_pos);
> +               if (!memcg)
> +                       continue;
> +
> +               usage = bpf_mem_cgroup_usage(memcg);
> +               pagecache = bpf_mem_cgroup_page_state(memcg, NR_FILE_PAGES);
> +
> +               if (usage > pagecache)
> +                       usage -= pagecache;
> +               else
> +                       usage = 0;
> +
> +               if ((usage > max_usage) && mem_cgroup_killable(memcg)) {
> +                       max_usage = usage;
> +                       if (victim)
> +                               bpf_put_mem_cgroup(victim);
> +                       victim = bpf_get_mem_cgroup(&memcg->css);
> +               }
> +
> +               bpf_put_mem_cgroup(memcg);
> +       }
> +       bpf_rcu_read_unlock();
> +
> +       if (!victim)
> +               goto exit;
> +
> +       bpf_for_each(css_task, task, &victim->css, CSS_TASK_ITER_PROCS) {
> +               struct task_struct *t = bpf_task_acquire(task);
> +
> +               if (t) {
> +                       if (!bpf_task_is_oom_victim(task))
> +                               bpf_oom_kill_process(oc, task, "bpf oom test");

Is there a scenario where we want to invoke bpf_oom_kill_process when
the task is not an oom victim?
Would it be better to subsume this check in the kfunc itself?

> +                       bpf_task_release(t);
> +                       ret = 1;
> +               }
> +       }
> +
> +       bpf_put_mem_cgroup(victim);
> +exit:
> +       bpf_put_mem_cgroup(root_memcg);
> +
> +       return ret;
> +}
> +
> +SEC(".struct_ops.link")
> +struct bpf_oom_ops test_bpf_oom = {
> +       .name = "bpf_test_policy",
> +       .handle_out_of_memory = (void *)test_out_of_memory,
> +};
> --
> 2.50.1
>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 06/14] mm: introduce bpf_out_of_memory() bpf kfunc
  2025-08-18 17:01 ` [PATCH v1 06/14] mm: introduce bpf_out_of_memory() " Roman Gushchin
  2025-08-19  4:09   ` Suren Baghdasaryan
@ 2025-08-20  9:34   ` Kumar Kartikeya Dwivedi
  2025-08-20 22:59     ` Roman Gushchin
  1 sibling, 1 reply; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-08-20  9:34 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, bpf, Suren Baghdasaryan, Johannes Weiner, Michal Hocko,
	David Rientjes, Matt Bobrowski, Song Liu, Alexei Starovoitov,
	Andrew Morton, linux-kernel

On Mon, 18 Aug 2025 at 19:02, Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> Introduce bpf_out_of_memory() bpf kfunc, which allows to declare
> an out of memory events and trigger the corresponding kernel OOM
> handling mechanism.
>
> It takes a trusted memcg pointer (or NULL for system-wide OOMs)
> as an argument, as well as the page order.
>
> If the wait_on_oom_lock argument is not set, only one OOM can be
> declared and handled in the system at once, so if the function is
> called in parallel to another OOM handling, it bails out with -EBUSY.
> This mode is suited for global OOM's: any concurrent OOMs will likely
> do the job and release some memory. In a blocking mode (which is
> suited for memcg OOMs) the execution will wait on the oom_lock mutex.
>
> The function is declared as sleepable. It guarantees that it won't
> be called from an atomic context. It's required by the OOM handling
> code, which is not guaranteed to work in a non-blocking context.
>
> Handling of a memcg OOM almost always requires taking of the
> css_set_lock spinlock. The fact that bpf_out_of_memory() is sleepable
> also guarantees that it can't be called with acquired css_set_lock,
> so the kernel can't deadlock on it.
>
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> ---
>  mm/oom_kill.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 45 insertions(+)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 25fc5e744e27..df409f0fac45 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1324,10 +1324,55 @@ __bpf_kfunc int bpf_oom_kill_process(struct oom_control *oc,
>         return 0;
>  }
>
> +/**
> + * bpf_out_of_memory - declare Out Of Memory state and invoke OOM killer
> + * @memcg__nullable: memcg or NULL for system-wide OOMs
> + * @order: order of page which wasn't allocated
> + * @wait_on_oom_lock: if true, block on oom_lock
> + * @constraint_text__nullable: custom constraint description for the OOM report
> + *
> + * Declares the Out Of Memory state and invokes the OOM killer.
> + *
> + * OOM handlers are synchronized using the oom_lock mutex. If wait_on_oom_lock
> + * is true, the function will wait on it. Otherwise it bails out with -EBUSY
> + * if oom_lock is contended.
> + *
> + * Generally it's advised to pass wait_on_oom_lock=true for global OOMs
> + * and wait_on_oom_lock=false for memcg-scoped OOMs.
> + *
> + * Returns 1 if the forward progress was achieved and some memory was freed.
> + * Returns a negative value if an error has been occurred.
> + */
> +__bpf_kfunc int bpf_out_of_memory(struct mem_cgroup *memcg__nullable,
> +                                 int order, bool wait_on_oom_lock)

I think this bool should be a u64 flags instead, just to make it
easier to extend behavior in the future.

> +{
> +       struct oom_control oc = {
> +               .memcg = memcg__nullable,
> +               .order = order,
> +       };
> +       int ret;
> +
> +       if (oc.order < 0 || oc.order > MAX_PAGE_ORDER)
> +               return -EINVAL;
> +
> +       if (wait_on_oom_lock) {
> +               ret = mutex_lock_killable(&oom_lock);
> +               if (ret)
> +                       return ret;
> +       } else if (!mutex_trylock(&oom_lock))
> +               return -EBUSY;
> +
> +       ret = out_of_memory(&oc);
> +
> +       mutex_unlock(&oom_lock);
> +       return ret;
> +}
> +
>  __bpf_kfunc_end_defs();
>
>  BTF_KFUNCS_START(bpf_oom_kfuncs)
>  BTF_ID_FLAGS(func, bpf_oom_kill_process, KF_SLEEPABLE | KF_TRUSTED_ARGS)
> +BTF_ID_FLAGS(func, bpf_out_of_memory, KF_SLEEPABLE | KF_TRUSTED_ARGS)
>  BTF_KFUNCS_END(bpf_oom_kfuncs)
>
>  static const struct btf_kfunc_id_set bpf_oom_kfunc_set = {
> --
> 2.50.1
>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 01/14] mm: introduce bpf struct ops for OOM handling
  2025-08-18 17:01 ` [PATCH v1 01/14] mm: introduce bpf struct ops for OOM handling Roman Gushchin
  2025-08-19  4:09   ` Suren Baghdasaryan
@ 2025-08-20 11:28   ` Kumar Kartikeya Dwivedi
  2025-08-21  0:24     ` Roman Gushchin
  2025-08-26 16:56   ` Amery Hung
  2 siblings, 1 reply; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-08-20 11:28 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, bpf, Suren Baghdasaryan, Johannes Weiner, Michal Hocko,
	David Rientjes, Matt Bobrowski, Song Liu, Alexei Starovoitov,
	Andrew Morton, linux-kernel

On Mon, 18 Aug 2025 at 19:01, Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> Introduce a bpf struct ops for implementing custom OOM handling policies.
>
> The struct ops provides the bpf_handle_out_of_memory() callback,
> which expected to return 1 if it was able to free some memory and 0
> otherwise.
>
> In the latter case it's guaranteed that the in-kernel OOM killer will
> be invoked. Otherwise the kernel also checks the bpf_memory_freed
> field of the oom_control structure, which is expected to be set by
> kfuncs suitable for releasing memory. It's a safety mechanism which
> prevents a bpf program to claim forward progress without actually
> releasing memory. The callback program is sleepable to enable using
> iterators, e.g. cgroup iterators.
>
> The callback receives struct oom_control as an argument, so it can
> easily filter out OOM's it doesn't want to handle, e.g. global vs
> memcg OOM's.
>
> The callback is executed just before the kernel victim task selection
> algorithm, so all heuristics and sysctls like panic on oom,
> sysctl_oom_kill_allocating_task and sysctl_oom_kill_allocating_task
> are respected.
>
> The struct ops also has the name field, which allows to define a
> custom name for the implemented policy. It's printed in the OOM report
> in the oom_policy=<policy> format. "default" is printed if bpf is not
> used or policy name is not specified.
>
> [  112.696676] test_progs invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
>                oom_policy=bpf_test_policy
> [  112.698160] CPU: 1 UID: 0 PID: 660 Comm: test_progs Not tainted 6.16.0-00015-gf09eb0d6badc #102 PREEMPT(full)
> [  112.698165] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-5.fc42 04/01/2014
> [  112.698167] Call Trace:
> [  112.698177]  <TASK>
> [  112.698182]  dump_stack_lvl+0x4d/0x70
> [  112.698192]  dump_header+0x59/0x1c6
> [  112.698199]  oom_kill_process.cold+0x8/0xef
> [  112.698206]  bpf_oom_kill_process+0x59/0xb0
> [  112.698216]  bpf_prog_7ecad0f36a167fd7_test_out_of_memory+0x2be/0x313
> [  112.698229]  bpf__bpf_oom_ops_handle_out_of_memory+0x47/0xaf
> [  112.698236]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  112.698240]  bpf_handle_oom+0x11a/0x1e0
> [  112.698250]  out_of_memory+0xab/0x5c0
> [  112.698258]  mem_cgroup_out_of_memory+0xbc/0x110
> [  112.698274]  try_charge_memcg+0x4b5/0x7e0
> [  112.698288]  charge_memcg+0x2f/0xc0
> [  112.698293]  __mem_cgroup_charge+0x30/0xc0
> [  112.698299]  do_anonymous_page+0x40f/0xa50
> [  112.698311]  __handle_mm_fault+0xbba/0x1140
> [  112.698317]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  112.698335]  handle_mm_fault+0xe6/0x370
> [  112.698343]  do_user_addr_fault+0x211/0x6a0
> [  112.698354]  exc_page_fault+0x75/0x1d0
> [  112.698363]  asm_exc_page_fault+0x26/0x30
> [  112.698366] RIP: 0033:0x7fa97236db00
>
> It's possible to load multiple bpf struct programs. In the case of
> oom, they will be executed one by one in the same order they been
> loaded until one of them returns 1 and bpf_memory_freed is set to 1
> - an indication that the memory was freed. This allows to have
> multiple bpf programs to focus on different types of OOM's - e.g.
> one program can only handle memcg OOM's in one memory cgroup.
> But the filtering is done in bpf - so it's fully flexible.

I think a natural question here is ordering. Is this ability to have
multiple OOM programs critical right now?
How is it decided who gets to run before the other? Is it based on
order of attachment (which can be non-deterministic)?
There was a lot of discussion on something similar for tc progs, and
we went with specific flags that capture partial ordering constraints
(instead of priorities that may collide).
https://lore.kernel.org/all/20230719140858.13224-2-daniel@iogearbox.net
It would be nice if we can find a way of making this consistent.

Another option is to exclude the multiple attachment bit from the
initial version and do this as a follow up, since it probably requires
more discussion.

>
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> ---

> [...]


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 01/14] mm: introduce bpf struct ops for OOM handling
  2025-08-19 20:06     ` Roman Gushchin
@ 2025-08-20 19:34       ` Suren Baghdasaryan
  2025-08-20 19:52         ` Roman Gushchin
  2025-08-26 16:23         ` Amery Hung
  0 siblings, 2 replies; 67+ messages in thread
From: Suren Baghdasaryan @ 2025-08-20 19:34 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, bpf, Johannes Weiner, Michal Hocko, David Rientjes,
	Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel

On Tue, Aug 19, 2025 at 1:06 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> Suren Baghdasaryan <surenb@google.com> writes:
>
> > On Mon, Aug 18, 2025 at 10:01 AM Roman Gushchin
> > <roman.gushchin@linux.dev> wrote:
> >>
> >> Introduce a bpf struct ops for implementing custom OOM handling policies.
> >>
> >> The struct ops provides the bpf_handle_out_of_memory() callback,
> >> which expected to return 1 if it was able to free some memory and 0
> >> otherwise.
> >>
> >> In the latter case it's guaranteed that the in-kernel OOM killer will
> >> be invoked. Otherwise the kernel also checks the bpf_memory_freed
> >> field of the oom_control structure, which is expected to be set by
> >> kfuncs suitable for releasing memory. It's a safety mechanism which
> >> prevents a bpf program to claim forward progress without actually
> >> releasing memory. The callback program is sleepable to enable using
> >> iterators, e.g. cgroup iterators.
> >>
> >> The callback receives struct oom_control as an argument, so it can
> >> easily filter out OOM's it doesn't want to handle, e.g. global vs
> >> memcg OOM's.
> >>
> >> The callback is executed just before the kernel victim task selection
> >> algorithm, so all heuristics and sysctls like panic on oom,
> >> sysctl_oom_kill_allocating_task and sysctl_oom_kill_allocating_task
> >> are respected.
> >>
> >> The struct ops also has the name field, which allows to define a
> >> custom name for the implemented policy. It's printed in the OOM report
> >> in the oom_policy=<policy> format. "default" is printed if bpf is not
> >> used or policy name is not specified.
> >>
> >> [  112.696676] test_progs invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
> >>                oom_policy=bpf_test_policy
> >> [  112.698160] CPU: 1 UID: 0 PID: 660 Comm: test_progs Not tainted 6.16.0-00015-gf09eb0d6badc #102 PREEMPT(full)
> >> [  112.698165] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-5.fc42 04/01/2014
> >> [  112.698167] Call Trace:
> >> [  112.698177]  <TASK>
> >> [  112.698182]  dump_stack_lvl+0x4d/0x70
> >> [  112.698192]  dump_header+0x59/0x1c6
> >> [  112.698199]  oom_kill_process.cold+0x8/0xef
> >> [  112.698206]  bpf_oom_kill_process+0x59/0xb0
> >> [  112.698216]  bpf_prog_7ecad0f36a167fd7_test_out_of_memory+0x2be/0x313
> >> [  112.698229]  bpf__bpf_oom_ops_handle_out_of_memory+0x47/0xaf
> >> [  112.698236]  ? srso_alias_return_thunk+0x5/0xfbef5
> >> [  112.698240]  bpf_handle_oom+0x11a/0x1e0
> >> [  112.698250]  out_of_memory+0xab/0x5c0
> >> [  112.698258]  mem_cgroup_out_of_memory+0xbc/0x110
> >> [  112.698274]  try_charge_memcg+0x4b5/0x7e0
> >> [  112.698288]  charge_memcg+0x2f/0xc0
> >> [  112.698293]  __mem_cgroup_charge+0x30/0xc0
> >> [  112.698299]  do_anonymous_page+0x40f/0xa50
> >> [  112.698311]  __handle_mm_fault+0xbba/0x1140
> >> [  112.698317]  ? srso_alias_return_thunk+0x5/0xfbef5
> >> [  112.698335]  handle_mm_fault+0xe6/0x370
> >> [  112.698343]  do_user_addr_fault+0x211/0x6a0
> >> [  112.698354]  exc_page_fault+0x75/0x1d0
> >> [  112.698363]  asm_exc_page_fault+0x26/0x30
> >> [  112.698366] RIP: 0033:0x7fa97236db00
> >>
> >> It's possible to load multiple bpf struct programs. In the case of
> >> oom, they will be executed one by one in the same order they been
> >> loaded until one of them returns 1 and bpf_memory_freed is set to 1
> >> - an indication that the memory was freed. This allows to have
> >> multiple bpf programs to focus on different types of OOM's - e.g.
> >> one program can only handle memcg OOM's in one memory cgroup.
> >> But the filtering is done in bpf - so it's fully flexible.
> >>
> >> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> >> ---
> >>  include/linux/bpf_oom.h |  49 +++++++++++++
> >>  include/linux/oom.h     |   8 ++
> >>  mm/Makefile             |   3 +
> >>  mm/bpf_oom.c            | 157 ++++++++++++++++++++++++++++++++++++++++
> >>  mm/oom_kill.c           |  22 +++++-
> >>  5 files changed, 237 insertions(+), 2 deletions(-)
> >>  create mode 100644 include/linux/bpf_oom.h
> >>  create mode 100644 mm/bpf_oom.c
> >>
> >> diff --git a/include/linux/bpf_oom.h b/include/linux/bpf_oom.h
> >> new file mode 100644
> >> index 000000000000..29cb5ea41d97
> >> --- /dev/null
> >> +++ b/include/linux/bpf_oom.h
> >> @@ -0,0 +1,49 @@
> >> +/* SPDX-License-Identifier: GPL-2.0+ */
> >> +
> >> +#ifndef __BPF_OOM_H
> >> +#define __BPF_OOM_H
> >> +
> >> +struct bpf_oom;
> >> +struct oom_control;
> >> +
> >> +#define BPF_OOM_NAME_MAX_LEN 64
> >> +
> >> +struct bpf_oom_ops {
> >> +       /**
> >> +        * @handle_out_of_memory: Out of memory bpf handler, called before
> >> +        * the in-kernel OOM killer.
> >> +        * @oc: OOM control structure
> >> +        *
> >> +        * Should return 1 if some memory was freed up, otherwise
> >> +        * the in-kernel OOM killer is invoked.
> >> +        */
> >> +       int (*handle_out_of_memory)(struct oom_control *oc);
> >> +
> >> +       /**
> >> +        * @name: BPF OOM policy name
> >> +        */
> >> +       char name[BPF_OOM_NAME_MAX_LEN];
> >
> > Why should the name be a part of ops structure? IMO it's not an
> > attribute of the operations but rather of the oom handler which is
> > represented by bpf_oom here.
>
> The ops structure describes a user-defined oom policy. Currently
> it's just one handler and the policy name. Later additional handlers
> can be added, e.g. a handler to control the dmesg output.
>
> bpf_oom is an implementation detail: it's basically an extension
> to struct bpf_oom_ops which contains "private" fields required
> for the internal machinery.

Ok. I hope we can come up with some more descriptive naming but I
can't think of something good ATM.

>
> >
> >> +
> >> +       /* Private */
> >> +       struct bpf_oom *bpf_oom;
> >> +};
> >> +
> >> +#ifdef CONFIG_BPF_SYSCALL
> >> +/**
> >> + * @bpf_handle_oom: handle out of memory using bpf programs
> >> + * @oc: OOM control structure
> >> + *
> >> + * Returns true if a bpf oom program was executed, returned 1
> >> + * and some memory was actually freed.
> >
> > The above comment is unclear, please clarify.
>
> Fixed, thanks.
>
> /**
>  * @bpf_handle_oom: handle out of memory condition using bpf
>  * @oc: OOM control structure
>  *
>  * Returns true if some memory was freed.
>  */
> bool bpf_handle_oom(struct oom_control *oc);
>
>
> >
> >> + */
> >> +bool bpf_handle_oom(struct oom_control *oc);
> >> +
> >> +#else /* CONFIG_BPF_SYSCALL */
> >> +static inline bool bpf_handle_oom(struct oom_control *oc)
> >> +{
> >> +       return false;
> >> +}
> >> +
> >> +#endif /* CONFIG_BPF_SYSCALL */
> >> +
> >> +#endif /* __BPF_OOM_H */
> >> diff --git a/include/linux/oom.h b/include/linux/oom.h
> >> index 1e0fc6931ce9..ef453309b7ea 100644
> >> --- a/include/linux/oom.h
> >> +++ b/include/linux/oom.h
> >> @@ -51,6 +51,14 @@ struct oom_control {
> >>
> >>         /* Used to print the constraint info. */
> >>         enum oom_constraint constraint;
> >> +
> >> +#ifdef CONFIG_BPF_SYSCALL
> >> +       /* Used by the bpf oom implementation to mark the forward progress */
> >> +       bool bpf_memory_freed;
> >> +
> >> +       /* Policy name */
> >> +       const char *bpf_policy_name;
> >> +#endif
> >>  };
> >>
> >>  extern struct mutex oom_lock;
> >> diff --git a/mm/Makefile b/mm/Makefile
> >> index 1a7a11d4933d..a714aba03759 100644
> >> --- a/mm/Makefile
> >> +++ b/mm/Makefile
> >> @@ -105,6 +105,9 @@ obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
> >>  ifdef CONFIG_SWAP
> >>  obj-$(CONFIG_MEMCG) += swap_cgroup.o
> >>  endif
> >> +ifdef CONFIG_BPF_SYSCALL
> >> +obj-y += bpf_oom.o
> >> +endif
> >>  obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
> >>  obj-$(CONFIG_GUP_TEST) += gup_test.o
> >>  obj-$(CONFIG_DMAPOOL_TEST) += dmapool_test.o
> >> diff --git a/mm/bpf_oom.c b/mm/bpf_oom.c
> >> new file mode 100644
> >> index 000000000000..47633046819c
> >> --- /dev/null
> >> +++ b/mm/bpf_oom.c
> >> @@ -0,0 +1,157 @@
> >> +// SPDX-License-Identifier: GPL-2.0-or-later
> >> +/*
> >> + * BPF-driven OOM killer customization
> >> + *
> >> + * Author: Roman Gushchin <roman.gushchin@linux.dev>
> >> + */
> >> +
> >> +#include <linux/bpf.h>
> >> +#include <linux/oom.h>
> >> +#include <linux/bpf_oom.h>
> >> +#include <linux/srcu.h>
> >> +
> >> +DEFINE_STATIC_SRCU(bpf_oom_srcu);
> >> +static DEFINE_SPINLOCK(bpf_oom_lock);
> >> +static LIST_HEAD(bpf_oom_handlers);
> >> +
> >> +struct bpf_oom {
> >
> > Perhaps bpf_oom_handler ? Then bpf_oom_ops->bpf_oom could be called
> > bpf_oom_ops->handler.
>
> I don't think it's a handler, it's more like a private part
> of bpf_oom_ops. Maybe bpf_oom_impl? Idk

Yeah, we need to come up with some nomenclature and name these structs
accordingly. In my mind ops means a structure that contains only
operations, so current naming does not sit well but maybe that's just
me...

>
> >
> >
> >> +       struct bpf_oom_ops *ops;
> >> +       struct list_head node;
> >> +       struct srcu_struct srcu;
> >> +};
> >> +
> >> +bool bpf_handle_oom(struct oom_control *oc)
> >> +{
> >> +       struct bpf_oom_ops *ops;
> >> +       struct bpf_oom *bpf_oom;
> >> +       int list_idx, idx, ret = 0;
> >> +
> >> +       oc->bpf_memory_freed = false;
> >> +
> >> +       list_idx = srcu_read_lock(&bpf_oom_srcu);
> >> +       list_for_each_entry_srcu(bpf_oom, &bpf_oom_handlers, node, false) {
> >> +               ops = READ_ONCE(bpf_oom->ops);
> >> +               if (!ops || !ops->handle_out_of_memory)
> >> +                       continue;
> >> +               idx = srcu_read_lock(&bpf_oom->srcu);
> >> +               oc->bpf_policy_name = ops->name[0] ? &ops->name[0] :
> >> +                       "bpf_defined_policy";
> >> +               ret = ops->handle_out_of_memory(oc);
> >> +               oc->bpf_policy_name = NULL;
> >> +               srcu_read_unlock(&bpf_oom->srcu, idx);
> >> +
> >> +               if (ret && oc->bpf_memory_freed)
> >
> > IIUC ret and oc->bpf_memory_freed seem to reflect the same state:
> > handler successfully freed some memory. Could you please clarify when
> > they differ?
>
> The idea here is to provide an additional safety measure:
> if the bpf program simple returns 1 without doing anything,
> the system won't deadlock.
>
> oc->bpf_memory_freed is set by the bpf_oom_kill_process() helper
> (and potentially some other helpers in the future, e.g.
> bpf_oom_rm_tmpfs_file()) and can't be modified by the bpf
> program directly.

I see. Then maybe we use only oc->bpf_memory_freed and
handle_out_of_memory() does not return anything?


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 01/14] mm: introduce bpf struct ops for OOM handling
  2025-08-20 19:34       ` Suren Baghdasaryan
@ 2025-08-20 19:52         ` Roman Gushchin
  2025-08-20 20:01           ` Suren Baghdasaryan
  2025-08-26 16:23         ` Amery Hung
  1 sibling, 1 reply; 67+ messages in thread
From: Roman Gushchin @ 2025-08-20 19:52 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: linux-mm, bpf, Johannes Weiner, Michal Hocko, David Rientjes,
	Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel

Suren Baghdasaryan <surenb@google.com> writes:

> On Tue, Aug 19, 2025 at 1:06 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>>
>> Suren Baghdasaryan <surenb@google.com> writes:
>>
>> > On Mon, Aug 18, 2025 at 10:01 AM Roman Gushchin
>> > <roman.gushchin@linux.dev> wrote:
>> >>
>> >> Introduce a bpf struct ops for implementing custom OOM handling policies.
>> >>
>> >> The struct ops provides the bpf_handle_out_of_memory() callback,
>> >> which expected to return 1 if it was able to free some memory and 0
>> >> otherwise.
>> >>
>> >> In the latter case it's guaranteed that the in-kernel OOM killer will
>> >> be invoked. Otherwise the kernel also checks the bpf_memory_freed
>> >> field of the oom_control structure, which is expected to be set by
>> >> kfuncs suitable for releasing memory. It's a safety mechanism which
>> >> prevents a bpf program to claim forward progress without actually
>> >> releasing memory. The callback program is sleepable to enable using
>> >> iterators, e.g. cgroup iterators.
>> >>
>> >> The callback receives struct oom_control as an argument, so it can
>> >> easily filter out OOM's it doesn't want to handle, e.g. global vs
>> >> memcg OOM's.
>> >>
>> >> The callback is executed just before the kernel victim task selection
>> >> algorithm, so all heuristics and sysctls like panic on oom,
>> >> sysctl_oom_kill_allocating_task and sysctl_oom_kill_allocating_task
>> >> are respected.
>> >>
>> >> The struct ops also has the name field, which allows to define a
>> >> custom name for the implemented policy. It's printed in the OOM report
>> >> in the oom_policy=<policy> format. "default" is printed if bpf is not
>> >> used or policy name is not specified.
>> >>
>> >> [  112.696676] test_progs invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
>> >>                oom_policy=bpf_test_policy
>> >> [  112.698160] CPU: 1 UID: 0 PID: 660 Comm: test_progs Not tainted 6.16.0-00015-gf09eb0d6badc #102 PREEMPT(full)
>> >> [  112.698165] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-5.fc42 04/01/2014
>> >> [  112.698167] Call Trace:
>> >> [  112.698177]  <TASK>
>> >> [  112.698182]  dump_stack_lvl+0x4d/0x70
>> >> [  112.698192]  dump_header+0x59/0x1c6
>> >> [  112.698199]  oom_kill_process.cold+0x8/0xef
>> >> [  112.698206]  bpf_oom_kill_process+0x59/0xb0
>> >> [  112.698216]  bpf_prog_7ecad0f36a167fd7_test_out_of_memory+0x2be/0x313
>> >> [  112.698229]  bpf__bpf_oom_ops_handle_out_of_memory+0x47/0xaf
>> >> [  112.698236]  ? srso_alias_return_thunk+0x5/0xfbef5
>> >> [  112.698240]  bpf_handle_oom+0x11a/0x1e0
>> >> [  112.698250]  out_of_memory+0xab/0x5c0
>> >> [  112.698258]  mem_cgroup_out_of_memory+0xbc/0x110
>> >> [  112.698274]  try_charge_memcg+0x4b5/0x7e0
>> >> [  112.698288]  charge_memcg+0x2f/0xc0
>> >> [  112.698293]  __mem_cgroup_charge+0x30/0xc0
>> >> [  112.698299]  do_anonymous_page+0x40f/0xa50
>> >> [  112.698311]  __handle_mm_fault+0xbba/0x1140
>> >> [  112.698317]  ? srso_alias_return_thunk+0x5/0xfbef5
>> >> [  112.698335]  handle_mm_fault+0xe6/0x370
>> >> [  112.698343]  do_user_addr_fault+0x211/0x6a0
>> >> [  112.698354]  exc_page_fault+0x75/0x1d0
>> >> [  112.698363]  asm_exc_page_fault+0x26/0x30
>> >> [  112.698366] RIP: 0033:0x7fa97236db00
>> >>
>> >> It's possible to load multiple bpf struct programs. In the case of
>> >> oom, they will be executed one by one in the same order they been
>> >> loaded until one of them returns 1 and bpf_memory_freed is set to 1
>> >> - an indication that the memory was freed. This allows to have
>> >> multiple bpf programs to focus on different types of OOM's - e.g.
>> >> one program can only handle memcg OOM's in one memory cgroup.
>> >> But the filtering is done in bpf - so it's fully flexible.
>> >>
>> >> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
>> >> ---
>> >>  include/linux/bpf_oom.h |  49 +++++++++++++
>> >>  include/linux/oom.h     |   8 ++
>> >>  mm/Makefile             |   3 +
>> >>  mm/bpf_oom.c            | 157 ++++++++++++++++++++++++++++++++++++++++
>> >>  mm/oom_kill.c           |  22 +++++-
>> >>  5 files changed, 237 insertions(+), 2 deletions(-)
>> >>  create mode 100644 include/linux/bpf_oom.h
>> >>  create mode 100644 mm/bpf_oom.c
>> >>
>> >> diff --git a/include/linux/bpf_oom.h b/include/linux/bpf_oom.h
>> >> new file mode 100644
>> >> index 000000000000..29cb5ea41d97
>> >> --- /dev/null
>> >> +++ b/include/linux/bpf_oom.h
>> >> @@ -0,0 +1,49 @@
>> >> +/* SPDX-License-Identifier: GPL-2.0+ */
>> >> +
>> >> +#ifndef __BPF_OOM_H
>> >> +#define __BPF_OOM_H
>> >> +
>> >> +struct bpf_oom;
>> >> +struct oom_control;
>> >> +
>> >> +#define BPF_OOM_NAME_MAX_LEN 64
>> >> +
>> >> +struct bpf_oom_ops {
>> >> +       /**
>> >> +        * @handle_out_of_memory: Out of memory bpf handler, called before
>> >> +        * the in-kernel OOM killer.
>> >> +        * @oc: OOM control structure
>> >> +        *
>> >> +        * Should return 1 if some memory was freed up, otherwise
>> >> +        * the in-kernel OOM killer is invoked.
>> >> +        */
>> >> +       int (*handle_out_of_memory)(struct oom_control *oc);
>> >> +
>> >> +       /**
>> >> +        * @name: BPF OOM policy name
>> >> +        */
>> >> +       char name[BPF_OOM_NAME_MAX_LEN];
>> >
>> > Why should the name be a part of ops structure? IMO it's not an
>> > attribute of the operations but rather of the oom handler which is
>> > represented by bpf_oom here.
>>
>> The ops structure describes a user-defined oom policy. Currently
>> it's just one handler and the policy name. Later additional handlers
>> can be added, e.g. a handler to control the dmesg output.
>>
>> bpf_oom is an implementation detail: it's basically an extension
>> to struct bpf_oom_ops which contains "private" fields required
>> for the internal machinery.
>
> Ok. I hope we can come up with some more descriptive naming but I
> can't think of something good ATM.
>
>>
>> >
>> >> +
>> >> +       /* Private */
>> >> +       struct bpf_oom *bpf_oom;
>> >> +};
>> >> +
>> >> +#ifdef CONFIG_BPF_SYSCALL
>> >> +/**
>> >> + * @bpf_handle_oom: handle out of memory using bpf programs
>> >> + * @oc: OOM control structure
>> >> + *
>> >> + * Returns true if a bpf oom program was executed, returned 1
>> >> + * and some memory was actually freed.
>> >
>> > The above comment is unclear, please clarify.
>>
>> Fixed, thanks.
>>
>> /**
>>  * @bpf_handle_oom: handle out of memory condition using bpf
>>  * @oc: OOM control structure
>>  *
>>  * Returns true if some memory was freed.
>>  */
>> bool bpf_handle_oom(struct oom_control *oc);
>>
>>
>> >
>> >> + */
>> >> +bool bpf_handle_oom(struct oom_control *oc);
>> >> +
>> >> +#else /* CONFIG_BPF_SYSCALL */
>> >> +static inline bool bpf_handle_oom(struct oom_control *oc)
>> >> +{
>> >> +       return false;
>> >> +}
>> >> +
>> >> +#endif /* CONFIG_BPF_SYSCALL */
>> >> +
>> >> +#endif /* __BPF_OOM_H */
>> >> diff --git a/include/linux/oom.h b/include/linux/oom.h
>> >> index 1e0fc6931ce9..ef453309b7ea 100644
>> >> --- a/include/linux/oom.h
>> >> +++ b/include/linux/oom.h
>> >> @@ -51,6 +51,14 @@ struct oom_control {
>> >>
>> >>         /* Used to print the constraint info. */
>> >>         enum oom_constraint constraint;
>> >> +
>> >> +#ifdef CONFIG_BPF_SYSCALL
>> >> +       /* Used by the bpf oom implementation to mark the forward progress */
>> >> +       bool bpf_memory_freed;
>> >> +
>> >> +       /* Policy name */
>> >> +       const char *bpf_policy_name;
>> >> +#endif
>> >>  };
>> >>
>> >>  extern struct mutex oom_lock;
>> >> diff --git a/mm/Makefile b/mm/Makefile
>> >> index 1a7a11d4933d..a714aba03759 100644
>> >> --- a/mm/Makefile
>> >> +++ b/mm/Makefile
>> >> @@ -105,6 +105,9 @@ obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
>> >>  ifdef CONFIG_SWAP
>> >>  obj-$(CONFIG_MEMCG) += swap_cgroup.o
>> >>  endif
>> >> +ifdef CONFIG_BPF_SYSCALL
>> >> +obj-y += bpf_oom.o
>> >> +endif
>> >>  obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
>> >>  obj-$(CONFIG_GUP_TEST) += gup_test.o
>> >>  obj-$(CONFIG_DMAPOOL_TEST) += dmapool_test.o
>> >> diff --git a/mm/bpf_oom.c b/mm/bpf_oom.c
>> >> new file mode 100644
>> >> index 000000000000..47633046819c
>> >> --- /dev/null
>> >> +++ b/mm/bpf_oom.c
>> >> @@ -0,0 +1,157 @@
>> >> +// SPDX-License-Identifier: GPL-2.0-or-later
>> >> +/*
>> >> + * BPF-driven OOM killer customization
>> >> + *
>> >> + * Author: Roman Gushchin <roman.gushchin@linux.dev>
>> >> + */
>> >> +
>> >> +#include <linux/bpf.h>
>> >> +#include <linux/oom.h>
>> >> +#include <linux/bpf_oom.h>
>> >> +#include <linux/srcu.h>
>> >> +
>> >> +DEFINE_STATIC_SRCU(bpf_oom_srcu);
>> >> +static DEFINE_SPINLOCK(bpf_oom_lock);
>> >> +static LIST_HEAD(bpf_oom_handlers);
>> >> +
>> >> +struct bpf_oom {
>> >
>> > Perhaps bpf_oom_handler ? Then bpf_oom_ops->bpf_oom could be called
>> > bpf_oom_ops->handler.
>>
>> I don't think it's a handler, it's more like a private part
>> of bpf_oom_ops. Maybe bpf_oom_impl? Idk
>
> Yeah, we need to come up with some nomenclature and name these structs
> accordingly. In my mind ops means a structure that contains only
> operations, so current naming does not sit well but maybe that's just
> me...
>
>>
>> >
>> >
>> >> +       struct bpf_oom_ops *ops;
>> >> +       struct list_head node;
>> >> +       struct srcu_struct srcu;
>> >> +};
>> >> +
>> >> +bool bpf_handle_oom(struct oom_control *oc)
>> >> +{
>> >> +       struct bpf_oom_ops *ops;
>> >> +       struct bpf_oom *bpf_oom;
>> >> +       int list_idx, idx, ret = 0;
>> >> +
>> >> +       oc->bpf_memory_freed = false;
>> >> +
>> >> +       list_idx = srcu_read_lock(&bpf_oom_srcu);
>> >> +       list_for_each_entry_srcu(bpf_oom, &bpf_oom_handlers, node, false) {
>> >> +               ops = READ_ONCE(bpf_oom->ops);
>> >> +               if (!ops || !ops->handle_out_of_memory)
>> >> +                       continue;
>> >> +               idx = srcu_read_lock(&bpf_oom->srcu);
>> >> +               oc->bpf_policy_name = ops->name[0] ? &ops->name[0] :
>> >> +                       "bpf_defined_policy";
>> >> +               ret = ops->handle_out_of_memory(oc);
>> >> +               oc->bpf_policy_name = NULL;
>> >> +               srcu_read_unlock(&bpf_oom->srcu, idx);
>> >> +
>> >> +               if (ret && oc->bpf_memory_freed)
>> >
>> > IIUC ret and oc->bpf_memory_freed seem to reflect the same state:
>> > handler successfully freed some memory. Could you please clarify when
>> > they differ?
>>
>> The idea here is to provide an additional safety measure:
>> if the bpf program simple returns 1 without doing anything,
>> the system won't deadlock.
>>
>> oc->bpf_memory_freed is set by the bpf_oom_kill_process() helper
>> (and potentially some other helpers in the future, e.g.
>> bpf_oom_rm_tmpfs_file()) and can't be modified by the bpf
>> program directly.
>
> I see. Then maybe we use only oc->bpf_memory_freed and
> handle_out_of_memory() does not return anything?

Idk, I think it's neat to have an ability to pass to the in-kernel
OOM killer even after killing a task.
Also, I believe, bpf programs have to return an int anyway,
so we can ignore it, but I don't necessary see the point.



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 01/14] mm: introduce bpf struct ops for OOM handling
  2025-08-20 19:52         ` Roman Gushchin
@ 2025-08-20 20:01           ` Suren Baghdasaryan
  0 siblings, 0 replies; 67+ messages in thread
From: Suren Baghdasaryan @ 2025-08-20 20:01 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, bpf, Johannes Weiner, Michal Hocko, David Rientjes,
	Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel

On Wed, Aug 20, 2025 at 12:53 PM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
>
> Suren Baghdasaryan <surenb@google.com> writes:
>
> > On Tue, Aug 19, 2025 at 1:06 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >>
> >> Suren Baghdasaryan <surenb@google.com> writes:
> >>
> >> > On Mon, Aug 18, 2025 at 10:01 AM Roman Gushchin
> >> > <roman.gushchin@linux.dev> wrote:
> >> >>
> >> >> Introduce a bpf struct ops for implementing custom OOM handling policies.
> >> >>
> >> >> The struct ops provides the bpf_handle_out_of_memory() callback,
> >> >> which expected to return 1 if it was able to free some memory and 0
> >> >> otherwise.
> >> >>
> >> >> In the latter case it's guaranteed that the in-kernel OOM killer will
> >> >> be invoked. Otherwise the kernel also checks the bpf_memory_freed
> >> >> field of the oom_control structure, which is expected to be set by
> >> >> kfuncs suitable for releasing memory. It's a safety mechanism which
> >> >> prevents a bpf program to claim forward progress without actually
> >> >> releasing memory. The callback program is sleepable to enable using
> >> >> iterators, e.g. cgroup iterators.
> >> >>
> >> >> The callback receives struct oom_control as an argument, so it can
> >> >> easily filter out OOM's it doesn't want to handle, e.g. global vs
> >> >> memcg OOM's.
> >> >>
> >> >> The callback is executed just before the kernel victim task selection
> >> >> algorithm, so all heuristics and sysctls like panic on oom,
> >> >> sysctl_oom_kill_allocating_task and sysctl_oom_kill_allocating_task
> >> >> are respected.
> >> >>
> >> >> The struct ops also has the name field, which allows to define a
> >> >> custom name for the implemented policy. It's printed in the OOM report
> >> >> in the oom_policy=<policy> format. "default" is printed if bpf is not
> >> >> used or policy name is not specified.
> >> >>
> >> >> [  112.696676] test_progs invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
> >> >>                oom_policy=bpf_test_policy
> >> >> [  112.698160] CPU: 1 UID: 0 PID: 660 Comm: test_progs Not tainted 6.16.0-00015-gf09eb0d6badc #102 PREEMPT(full)
> >> >> [  112.698165] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-5.fc42 04/01/2014
> >> >> [  112.698167] Call Trace:
> >> >> [  112.698177]  <TASK>
> >> >> [  112.698182]  dump_stack_lvl+0x4d/0x70
> >> >> [  112.698192]  dump_header+0x59/0x1c6
> >> >> [  112.698199]  oom_kill_process.cold+0x8/0xef
> >> >> [  112.698206]  bpf_oom_kill_process+0x59/0xb0
> >> >> [  112.698216]  bpf_prog_7ecad0f36a167fd7_test_out_of_memory+0x2be/0x313
> >> >> [  112.698229]  bpf__bpf_oom_ops_handle_out_of_memory+0x47/0xaf
> >> >> [  112.698236]  ? srso_alias_return_thunk+0x5/0xfbef5
> >> >> [  112.698240]  bpf_handle_oom+0x11a/0x1e0
> >> >> [  112.698250]  out_of_memory+0xab/0x5c0
> >> >> [  112.698258]  mem_cgroup_out_of_memory+0xbc/0x110
> >> >> [  112.698274]  try_charge_memcg+0x4b5/0x7e0
> >> >> [  112.698288]  charge_memcg+0x2f/0xc0
> >> >> [  112.698293]  __mem_cgroup_charge+0x30/0xc0
> >> >> [  112.698299]  do_anonymous_page+0x40f/0xa50
> >> >> [  112.698311]  __handle_mm_fault+0xbba/0x1140
> >> >> [  112.698317]  ? srso_alias_return_thunk+0x5/0xfbef5
> >> >> [  112.698335]  handle_mm_fault+0xe6/0x370
> >> >> [  112.698343]  do_user_addr_fault+0x211/0x6a0
> >> >> [  112.698354]  exc_page_fault+0x75/0x1d0
> >> >> [  112.698363]  asm_exc_page_fault+0x26/0x30
> >> >> [  112.698366] RIP: 0033:0x7fa97236db00
> >> >>
> >> >> It's possible to load multiple bpf struct programs. In the case of
> >> >> oom, they will be executed one by one in the same order they been
> >> >> loaded until one of them returns 1 and bpf_memory_freed is set to 1
> >> >> - an indication that the memory was freed. This allows to have
> >> >> multiple bpf programs to focus on different types of OOM's - e.g.
> >> >> one program can only handle memcg OOM's in one memory cgroup.
> >> >> But the filtering is done in bpf - so it's fully flexible.
> >> >>
> >> >> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> >> >> ---
> >> >>  include/linux/bpf_oom.h |  49 +++++++++++++
> >> >>  include/linux/oom.h     |   8 ++
> >> >>  mm/Makefile             |   3 +
> >> >>  mm/bpf_oom.c            | 157 ++++++++++++++++++++++++++++++++++++++++
> >> >>  mm/oom_kill.c           |  22 +++++-
> >> >>  5 files changed, 237 insertions(+), 2 deletions(-)
> >> >>  create mode 100644 include/linux/bpf_oom.h
> >> >>  create mode 100644 mm/bpf_oom.c
> >> >>
> >> >> diff --git a/include/linux/bpf_oom.h b/include/linux/bpf_oom.h
> >> >> new file mode 100644
> >> >> index 000000000000..29cb5ea41d97
> >> >> --- /dev/null
> >> >> +++ b/include/linux/bpf_oom.h
> >> >> @@ -0,0 +1,49 @@
> >> >> +/* SPDX-License-Identifier: GPL-2.0+ */
> >> >> +
> >> >> +#ifndef __BPF_OOM_H
> >> >> +#define __BPF_OOM_H
> >> >> +
> >> >> +struct bpf_oom;
> >> >> +struct oom_control;
> >> >> +
> >> >> +#define BPF_OOM_NAME_MAX_LEN 64
> >> >> +
> >> >> +struct bpf_oom_ops {
> >> >> +       /**
> >> >> +        * @handle_out_of_memory: Out of memory bpf handler, called before
> >> >> +        * the in-kernel OOM killer.
> >> >> +        * @oc: OOM control structure
> >> >> +        *
> >> >> +        * Should return 1 if some memory was freed up, otherwise
> >> >> +        * the in-kernel OOM killer is invoked.
> >> >> +        */
> >> >> +       int (*handle_out_of_memory)(struct oom_control *oc);
> >> >> +
> >> >> +       /**
> >> >> +        * @name: BPF OOM policy name
> >> >> +        */
> >> >> +       char name[BPF_OOM_NAME_MAX_LEN];
> >> >
> >> > Why should the name be a part of ops structure? IMO it's not an
> >> > attribute of the operations but rather of the oom handler which is
> >> > represented by bpf_oom here.
> >>
> >> The ops structure describes a user-defined oom policy. Currently
> >> it's just one handler and the policy name. Later additional handlers
> >> can be added, e.g. a handler to control the dmesg output.
> >>
> >> bpf_oom is an implementation detail: it's basically an extension
> >> to struct bpf_oom_ops which contains "private" fields required
> >> for the internal machinery.
> >
> > Ok. I hope we can come up with some more descriptive naming but I
> > can't think of something good ATM.
> >
> >>
> >> >
> >> >> +
> >> >> +       /* Private */
> >> >> +       struct bpf_oom *bpf_oom;
> >> >> +};
> >> >> +
> >> >> +#ifdef CONFIG_BPF_SYSCALL
> >> >> +/**
> >> >> + * @bpf_handle_oom: handle out of memory using bpf programs
> >> >> + * @oc: OOM control structure
> >> >> + *
> >> >> + * Returns true if a bpf oom program was executed, returned 1
> >> >> + * and some memory was actually freed.
> >> >
> >> > The above comment is unclear, please clarify.
> >>
> >> Fixed, thanks.
> >>
> >> /**
> >>  * @bpf_handle_oom: handle out of memory condition using bpf
> >>  * @oc: OOM control structure
> >>  *
> >>  * Returns true if some memory was freed.
> >>  */
> >> bool bpf_handle_oom(struct oom_control *oc);
> >>
> >>
> >> >
> >> >> + */
> >> >> +bool bpf_handle_oom(struct oom_control *oc);
> >> >> +
> >> >> +#else /* CONFIG_BPF_SYSCALL */
> >> >> +static inline bool bpf_handle_oom(struct oom_control *oc)
> >> >> +{
> >> >> +       return false;
> >> >> +}
> >> >> +
> >> >> +#endif /* CONFIG_BPF_SYSCALL */
> >> >> +
> >> >> +#endif /* __BPF_OOM_H */
> >> >> diff --git a/include/linux/oom.h b/include/linux/oom.h
> >> >> index 1e0fc6931ce9..ef453309b7ea 100644
> >> >> --- a/include/linux/oom.h
> >> >> +++ b/include/linux/oom.h
> >> >> @@ -51,6 +51,14 @@ struct oom_control {
> >> >>
> >> >>         /* Used to print the constraint info. */
> >> >>         enum oom_constraint constraint;
> >> >> +
> >> >> +#ifdef CONFIG_BPF_SYSCALL
> >> >> +       /* Used by the bpf oom implementation to mark the forward progress */
> >> >> +       bool bpf_memory_freed;
> >> >> +
> >> >> +       /* Policy name */
> >> >> +       const char *bpf_policy_name;
> >> >> +#endif
> >> >>  };
> >> >>
> >> >>  extern struct mutex oom_lock;
> >> >> diff --git a/mm/Makefile b/mm/Makefile
> >> >> index 1a7a11d4933d..a714aba03759 100644
> >> >> --- a/mm/Makefile
> >> >> +++ b/mm/Makefile
> >> >> @@ -105,6 +105,9 @@ obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
> >> >>  ifdef CONFIG_SWAP
> >> >>  obj-$(CONFIG_MEMCG) += swap_cgroup.o
> >> >>  endif
> >> >> +ifdef CONFIG_BPF_SYSCALL
> >> >> +obj-y += bpf_oom.o
> >> >> +endif
> >> >>  obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
> >> >>  obj-$(CONFIG_GUP_TEST) += gup_test.o
> >> >>  obj-$(CONFIG_DMAPOOL_TEST) += dmapool_test.o
> >> >> diff --git a/mm/bpf_oom.c b/mm/bpf_oom.c
> >> >> new file mode 100644
> >> >> index 000000000000..47633046819c
> >> >> --- /dev/null
> >> >> +++ b/mm/bpf_oom.c
> >> >> @@ -0,0 +1,157 @@
> >> >> +// SPDX-License-Identifier: GPL-2.0-or-later
> >> >> +/*
> >> >> + * BPF-driven OOM killer customization
> >> >> + *
> >> >> + * Author: Roman Gushchin <roman.gushchin@linux.dev>
> >> >> + */
> >> >> +
> >> >> +#include <linux/bpf.h>
> >> >> +#include <linux/oom.h>
> >> >> +#include <linux/bpf_oom.h>
> >> >> +#include <linux/srcu.h>
> >> >> +
> >> >> +DEFINE_STATIC_SRCU(bpf_oom_srcu);
> >> >> +static DEFINE_SPINLOCK(bpf_oom_lock);
> >> >> +static LIST_HEAD(bpf_oom_handlers);
> >> >> +
> >> >> +struct bpf_oom {
> >> >
> >> > Perhaps bpf_oom_handler ? Then bpf_oom_ops->bpf_oom could be called
> >> > bpf_oom_ops->handler.
> >>
> >> I don't think it's a handler, it's more like a private part
> >> of bpf_oom_ops. Maybe bpf_oom_impl? Idk
> >
> > Yeah, we need to come up with some nomenclature and name these structs
> > accordingly. In my mind ops means a structure that contains only
> > operations, so current naming does not sit well but maybe that's just
> > me...
> >
> >>
> >> >
> >> >
> >> >> +       struct bpf_oom_ops *ops;
> >> >> +       struct list_head node;
> >> >> +       struct srcu_struct srcu;
> >> >> +};
> >> >> +
> >> >> +bool bpf_handle_oom(struct oom_control *oc)
> >> >> +{
> >> >> +       struct bpf_oom_ops *ops;
> >> >> +       struct bpf_oom *bpf_oom;
> >> >> +       int list_idx, idx, ret = 0;
> >> >> +
> >> >> +       oc->bpf_memory_freed = false;
> >> >> +
> >> >> +       list_idx = srcu_read_lock(&bpf_oom_srcu);
> >> >> +       list_for_each_entry_srcu(bpf_oom, &bpf_oom_handlers, node, false) {
> >> >> +               ops = READ_ONCE(bpf_oom->ops);
> >> >> +               if (!ops || !ops->handle_out_of_memory)
> >> >> +                       continue;
> >> >> +               idx = srcu_read_lock(&bpf_oom->srcu);
> >> >> +               oc->bpf_policy_name = ops->name[0] ? &ops->name[0] :
> >> >> +                       "bpf_defined_policy";
> >> >> +               ret = ops->handle_out_of_memory(oc);
> >> >> +               oc->bpf_policy_name = NULL;
> >> >> +               srcu_read_unlock(&bpf_oom->srcu, idx);
> >> >> +
> >> >> +               if (ret && oc->bpf_memory_freed)
> >> >
> >> > IIUC ret and oc->bpf_memory_freed seem to reflect the same state:
> >> > handler successfully freed some memory. Could you please clarify when
> >> > they differ?
> >>
> >> The idea here is to provide an additional safety measure:
> >> if the bpf program simple returns 1 without doing anything,
> >> the system won't deadlock.
> >>
> >> oc->bpf_memory_freed is set by the bpf_oom_kill_process() helper
> >> (and potentially some other helpers in the future, e.g.
> >> bpf_oom_rm_tmpfs_file()) and can't be modified by the bpf
> >> program directly.
> >
> > I see. Then maybe we use only oc->bpf_memory_freed and
> > handle_out_of_memory() does not return anything?
>
> Idk, I think it's neat to have an ability to pass to the in-kernel
> OOM killer even after killing a task.
> Also, I believe, bpf programs have to return an int anyway,
> so we can ignore it, but I don't necessary see the point.

Ok, so you want this parameter for the bpf oom-handler to say "even if
something got killed, proceed with the next oom-handler anyway?". I
can't think of a case when someone would do that... In any case, the
use for this return value should be clearly documented.

>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 10/14] bpf: selftests: bpf OOM handler test
  2025-08-18 17:01 ` [PATCH v1 10/14] bpf: selftests: bpf OOM handler test Roman Gushchin
  2025-08-20  9:33   ` Kumar Kartikeya Dwivedi
@ 2025-08-20 20:23   ` Andrii Nakryiko
  2025-08-21  0:10     ` Roman Gushchin
  1 sibling, 1 reply; 67+ messages in thread
From: Andrii Nakryiko @ 2025-08-20 20:23 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, bpf, Suren Baghdasaryan, Johannes Weiner, Michal Hocko,
	David Rientjes, Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel

On Mon, Aug 18, 2025 at 10:05 AM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
>
> Implement a pseudo-realistic test for the OOM handling
> functionality.
>
> The OOM handling policy which is implemented in bpf is to
> kill all tasks belonging to the biggest leaf cgroup, which
> doesn't contain unkillable tasks (tasks with oom_score_adj
> set to -1000). Pagecache size is excluded from the accounting.
>
> The test creates a hierarchy of memory cgroups, causes an
> OOM at the top level, checks that the expected process will be
> killed and checks memcg's oom statistics.
>
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> ---
>  .../selftests/bpf/prog_tests/test_oom.c       | 229 ++++++++++++++++++
>  tools/testing/selftests/bpf/progs/test_oom.c  | 108 +++++++++
>  2 files changed, 337 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/test_oom.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_oom.c
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/test_oom.c b/tools/testing/selftests/bpf/prog_tests/test_oom.c
> new file mode 100644
> index 000000000000..eaeb14a9d18f
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/test_oom.c
> @@ -0,0 +1,229 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +#include <test_progs.h>
> +#include <bpf/btf.h>
> +#include <bpf/bpf.h>
> +
> +#include "cgroup_helpers.h"
> +#include "test_oom.skel.h"
> +
> +struct cgroup_desc {
> +       const char *path;
> +       int fd;
> +       unsigned long long id;
> +       int pid;
> +       size_t target;
> +       size_t max;
> +       int oom_score_adj;
> +       bool victim;
> +};
> +
> +#define MB (1024 * 1024)
> +#define OOM_SCORE_ADJ_MIN      (-1000)
> +#define OOM_SCORE_ADJ_MAX      1000
> +
> +static struct cgroup_desc cgroups[] = {
> +       { .path = "/oom_test", .max = 80 * MB},
> +       { .path = "/oom_test/cg1", .target = 10 * MB,
> +         .oom_score_adj = OOM_SCORE_ADJ_MAX },
> +       { .path = "/oom_test/cg2", .target = 40 * MB,
> +         .oom_score_adj = OOM_SCORE_ADJ_MIN },
> +       { .path = "/oom_test/cg3" },
> +       { .path = "/oom_test/cg3/cg4", .target = 30 * MB,
> +         .victim = true },
> +       { .path = "/oom_test/cg3/cg5", .target = 20 * MB },
> +};
> +
> +static int spawn_task(struct cgroup_desc *desc)
> +{
> +       char *ptr;
> +       int pid;
> +
> +       pid = fork();
> +       if (pid < 0)
> +               return pid;
> +
> +       if (pid > 0) {
> +               /* parent */
> +               desc->pid = pid;
> +               return 0;
> +       }
> +
> +       /* child */
> +       if (desc->oom_score_adj) {
> +               char buf[64];
> +               int fd = open("/proc/self/oom_score_adj", O_WRONLY);
> +
> +               if (fd < 0)
> +                       return -1;
> +
> +               snprintf(buf, sizeof(buf), "%d", desc->oom_score_adj);
> +               write(fd, buf, sizeof(buf));
> +               close(fd);
> +       }
> +
> +       ptr = (char *)malloc(desc->target);
> +       if (!ptr)
> +               return -ENOMEM;
> +
> +       memset(ptr, 'a', desc->target);
> +
> +       while (1)
> +               sleep(1000);
> +
> +       return 0;
> +}
> +
> +static void setup_environment(void)
> +{
> +       int i, err;
> +
> +       err = setup_cgroup_environment();
> +       if (!ASSERT_OK(err, "setup_cgroup_environment"))
> +               goto cleanup;
> +
> +       for (i = 0; i < ARRAY_SIZE(cgroups); i++) {
> +               cgroups[i].fd = create_and_get_cgroup(cgroups[i].path);
> +               if (!ASSERT_GE(cgroups[i].fd, 0, "create_and_get_cgroup"))
> +                       goto cleanup;
> +
> +               cgroups[i].id = get_cgroup_id(cgroups[i].path);
> +               if (!ASSERT_GT(cgroups[i].id, 0, "get_cgroup_id"))
> +                       goto cleanup;
> +
> +               /* Freeze the top-level cgroup */
> +               if (i == 0) {
> +                       /* Freeze the top-level cgroup */
> +                       err = write_cgroup_file(cgroups[i].path, "cgroup.freeze", "1");
> +                       if (!ASSERT_OK(err, "freeze cgroup"))
> +                               goto cleanup;
> +               }
> +
> +               /* Recursively enable the memory controller */
> +               if (!cgroups[i].target) {
> +
> +                       err = write_cgroup_file(cgroups[i].path, "cgroup.subtree_control",
> +                                               "+memory");
> +                       if (!ASSERT_OK(err, "enable memory controller"))
> +                               goto cleanup;
> +               }
> +
> +               /* Set memory.max */
> +               if (cgroups[i].max) {
> +                       char buf[256];
> +
> +                       snprintf(buf, sizeof(buf), "%lu", cgroups[i].max);
> +                       err = write_cgroup_file(cgroups[i].path, "memory.max", buf);
> +                       if (!ASSERT_OK(err, "set memory.max"))
> +                               goto cleanup;
> +
> +                       snprintf(buf, sizeof(buf), "0");
> +                       write_cgroup_file(cgroups[i].path, "memory.swap.max", buf);
> +
> +               }
> +
> +               /* Spawn tasks creating memory pressure */
> +               if (cgroups[i].target) {
> +                       char buf[256];
> +
> +                       err = spawn_task(&cgroups[i]);
> +                       if (!ASSERT_OK(err, "spawn task"))
> +                               goto cleanup;
> +
> +                       snprintf(buf, sizeof(buf), "%d", cgroups[i].pid);
> +                       err = write_cgroup_file(cgroups[i].path, "cgroup.procs", buf);
> +                       if (!ASSERT_OK(err, "put child into a cgroup"))
> +                               goto cleanup;
> +               }
> +       }
> +
> +       return;
> +
> +cleanup:
> +       cleanup_cgroup_environment();
> +}
> +
> +static int run_and_wait_for_oom(void)
> +{
> +       int ret = -1;
> +       bool first = true;
> +       char buf[4096] = {};
> +       size_t size;
> +
> +       /* Unfreeze the top-level cgroup */
> +       ret = write_cgroup_file(cgroups[0].path, "cgroup.freeze", "0");
> +       if (!ASSERT_OK(ret, "freeze cgroup"))
> +               return -1;
> +
> +       for (;;) {
> +               int i, status;
> +               pid_t pid = wait(&status);
> +
> +               if (pid == -1) {
> +                       if (errno == EINTR)
> +                               continue;
> +                       /* ECHILD */
> +                       break;
> +               }
> +
> +               if (!first)
> +                       continue;
> +
> +               first = false;
> +
> +               /* Check which process was terminated first */
> +               for (i = 0; i < ARRAY_SIZE(cgroups); i++) {
> +                       if (!ASSERT_OK(cgroups[i].victim !=
> +                                      (pid == cgroups[i].pid),
> +                                      "correct process was killed")) {
> +                               ret = -1;
> +                               break;
> +                       }
> +
> +                       if (!cgroups[i].victim)
> +                               continue;
> +
> +                       /* Check the memcg oom counter */
> +                       size = read_cgroup_file(cgroups[i].path,
> +                                               "memory.events",
> +                                               buf, sizeof(buf));
> +                       if (!ASSERT_OK(size <= 0, "read memory.events")) {
> +                               ret = -1;
> +                               break;
> +                       }
> +
> +                       if (!ASSERT_OK(strstr(buf, "oom_kill 1") == NULL,
> +                                      "oom_kill count check")) {
> +                               ret = -1;
> +                               break;
> +                       }
> +               }
> +
> +               /* Kill all remaining tasks */
> +               for (i = 0; i < ARRAY_SIZE(cgroups); i++)
> +                       if (cgroups[i].pid && cgroups[i].pid != pid)
> +                               kill(cgroups[i].pid, SIGKILL);
> +       }
> +
> +       return ret;
> +}
> +
> +void test_oom(void)
> +{
> +       struct test_oom *skel;
> +       int err;
> +
> +       setup_environment();
> +
> +       skel = test_oom__open_and_load();
> +       err = test_oom__attach(skel);
> +       if (CHECK_FAIL(err))
> +               goto cleanup;
> +
> +       /* Unfreeze all child tasks and create the memory pressure */
> +       err = run_and_wait_for_oom();
> +       CHECK_FAIL(err);
> +
> +cleanup:
> +       cleanup_cgroup_environment();
> +       test_oom__destroy(skel);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/test_oom.c b/tools/testing/selftests/bpf/progs/test_oom.c
> new file mode 100644
> index 000000000000..ca83563fc9a8
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/test_oom.c
> @@ -0,0 +1,108 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +#include "vmlinux.h"
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +
> +char _license[] SEC("license") = "GPL";
> +
> +#define OOM_SCORE_ADJ_MIN      (-1000)
> +
> +void bpf_rcu_read_lock(void) __ksym;
> +void bpf_rcu_read_unlock(void) __ksym;
> +struct task_struct *bpf_task_acquire(struct task_struct *p) __ksym;
> +void bpf_task_release(struct task_struct *p) __ksym;
> +struct mem_cgroup *bpf_get_root_mem_cgroup(void) __ksym;
> +struct mem_cgroup *bpf_get_mem_cgroup(struct cgroup_subsys_state *css) __ksym;
> +void bpf_put_mem_cgroup(struct mem_cgroup *memcg) __ksym;
> +int bpf_oom_kill_process(struct oom_control *oc, struct task_struct *task,
> +                        const char *message__str) __ksym;

These declarations should come from vmlinux.h, if you don't get them,
you might not have recent enough pahole.

At the very least these should all be __ksym __weak, not just __ksym
(but I'd rather not add them, though).

[...]


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 13/14] sched: psi: implement bpf_psi_create_trigger() kfunc
  2025-08-18 17:01 ` [PATCH v1 13/14] sched: psi: implement bpf_psi_create_trigger() kfunc Roman Gushchin
@ 2025-08-20 20:30   ` Andrii Nakryiko
  2025-08-21  0:36     ` Roman Gushchin
  0 siblings, 1 reply; 67+ messages in thread
From: Andrii Nakryiko @ 2025-08-20 20:30 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, bpf, Suren Baghdasaryan, Johannes Weiner, Michal Hocko,
	David Rientjes, Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel

On Mon, Aug 18, 2025 at 10:06 AM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
>
> Implement a new bpf_psi_create_trigger() bpf kfunc, which allows
> to create new psi triggers and attach them to cgroups or be
> system-wide.
>
> Created triggers will exist until the struct ops is loaded and
> if they are attached to a cgroup until the cgroup exists.
>
> Due to a limitation of 5 arguments, the resource type and the "full"
> bit are squeezed into a single u32.
>
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> ---
>  kernel/sched/bpf_psi.c | 84 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 84 insertions(+)
>
> diff --git a/kernel/sched/bpf_psi.c b/kernel/sched/bpf_psi.c
> index 2ea9d7276b21..94b684221708 100644
> --- a/kernel/sched/bpf_psi.c
> +++ b/kernel/sched/bpf_psi.c
> @@ -156,6 +156,83 @@ static const struct bpf_verifier_ops bpf_psi_verifier_ops = {
>         .is_valid_access = bpf_psi_ops_is_valid_access,
>  };
>
> +__bpf_kfunc_start_defs();
> +
> +/**
> + * bpf_psi_create_trigger - Create a PSI trigger
> + * @bpf_psi: bpf_psi struct to attach the trigger to
> + * @cgroup_id: cgroup Id to attach the trigger; 0 for system-wide scope
> + * @resource: resource to monitor (PSI_MEM, PSI_IO, etc) and the full bit.
> + * @threshold_us: threshold in us
> + * @window_us: window in us
> + *
> + * Creates a PSI trigger and attached is to bpf_psi. The trigger will be
> + * active unless bpf struct ops is unloaded or the corresponding cgroup
> + * is deleted.
> + *
> + * Resource's most significant bit encodes whether "some" or "full"
> + * PSI state should be tracked.
> + *
> + * Returns 0 on success and the error code on failure.
> + */
> +__bpf_kfunc int bpf_psi_create_trigger(struct bpf_psi *bpf_psi,
> +                                      u64 cgroup_id, u32 resource,
> +                                      u32 threshold_us, u32 window_us)
> +{
> +       enum psi_res res = resource & ~BPF_PSI_FULL;
> +       bool full = resource & BPF_PSI_FULL;
> +       struct psi_trigger_params params;
> +       struct cgroup *cgroup __maybe_unused = NULL;
> +       struct psi_group *group;
> +       struct psi_trigger *t;
> +       int ret = 0;
> +
> +       if (res >= NR_PSI_RESOURCES)
> +               return -EINVAL;
> +
> +#ifdef CONFIG_CGROUPS
> +       if (cgroup_id) {
> +               cgroup = cgroup_get_from_id(cgroup_id);
> +               if (IS_ERR_OR_NULL(cgroup))
> +                       return PTR_ERR(cgroup);
> +
> +               group = cgroup_psi(cgroup);
> +       } else
> +#endif
> +               group = &psi_system;

just a drive-by comment while skimming through the patch set: can't
you use IS_ENABLED(CONFIG_CGROUPS) and have a proper if/else with
proper {} ?

> +
> +       params.type = PSI_BPF;
> +       params.bpf_psi = bpf_psi;
> +       params.privileged = capable(CAP_SYS_RESOURCE);
> +       params.res = res;
> +       params.full = full;
> +       params.threshold_us = threshold_us;
> +       params.window_us = window_us;
> +
> +       t = psi_trigger_create(group, &params);
> +       if (IS_ERR(t))
> +               ret = PTR_ERR(t);
> +       else
> +               t->cgroup_id = cgroup_id;
> +
> +#ifdef CONFIG_CGROUPS
> +       if (cgroup)
> +               cgroup_put(cgroup);
> +#endif
> +
> +       return ret;
> +}
> +__bpf_kfunc_end_defs();
> +
> +BTF_KFUNCS_START(bpf_psi_kfuncs)
> +BTF_ID_FLAGS(func, bpf_psi_create_trigger, KF_TRUSTED_ARGS)
> +BTF_KFUNCS_END(bpf_psi_kfuncs)
> +
> +static const struct btf_kfunc_id_set bpf_psi_kfunc_set = {
> +       .owner          = THIS_MODULE,
> +       .set            = &bpf_psi_kfuncs,
> +};
> +
>  static int bpf_psi_ops_reg(void *kdata, struct bpf_link *link)
>  {
>         struct bpf_psi_ops *ops = kdata;
> @@ -238,6 +315,13 @@ static int __init bpf_psi_struct_ops_init(void)
>         if (!bpf_psi_wq)
>                 return -ENOMEM;
>
> +       err = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
> +                                       &bpf_psi_kfunc_set);

would this make kfunc callable from any struct_ops, not just this psi one?

> +       if (err) {
> +               pr_warn("error while registering bpf psi kfuncs: %d", err);
> +               goto err;
> +       }
> +
>         err = register_bpf_struct_ops(&bpf_psi_bpf_ops, bpf_psi_ops);
>         if (err) {
>                 pr_warn("error while registering bpf psi struct ops: %d", err);
> --
> 2.50.1
>
>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 00/14] mm: BPF OOM
  2025-08-18 17:01 [PATCH v1 00/14] mm: BPF OOM Roman Gushchin
                   ` (14 preceding siblings ...)
  2025-08-19  4:08 ` [PATCH v1 00/14] mm: BPF OOM Suren Baghdasaryan
@ 2025-08-20 21:06 ` Shakeel Butt
  2025-08-21  0:01   ` Roman Gushchin
  15 siblings, 1 reply; 67+ messages in thread
From: Shakeel Butt @ 2025-08-20 21:06 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, bpf, Suren Baghdasaryan, Johannes Weiner, Michal Hocko,
	David Rientjes, Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel

On Mon, Aug 18, 2025 at 10:01:22AM -0700, Roman Gushchin wrote:
> This patchset adds an ability to customize the out of memory
> handling using bpf.
> 
> It focuses on two parts:
> 1) OOM handling policy,
> 2) PSI-based OOM invocation.
> 
> The idea to use bpf for customizing the OOM handling is not new, but
> unlike the previous proposal [1], which augmented the existing task
> ranking policy, this one tries to be as generic as possible and
> leverage the full power of the modern bpf.
> 
> It provides a generic interface which is called before the existing OOM
> killer code and allows implementing any policy, e.g. picking a victim
> task or memory cgroup or potentially even releasing memory in other
> ways, e.g. deleting tmpfs files (the last one might require some
> additional but relatively simple changes).

The releasing memory part is really interesting and useful. I can see
much more reliable and targetted oom reaping with this approach.

> 
> The past attempt to implement memory-cgroup aware policy [2] showed
> that there are multiple opinions on what the best policy is.  As it's
> highly workload-dependent and specific to a concrete way of organizing
> workloads, the structure of the cgroup tree etc,

and user space policies like Google has very clear priorities among
concurrently running workloads while many other users do not.

> a customizable
> bpf-based implementation is preferable over a in-kernel implementation
> with a dozen on sysctls.

+1

> 
> The second part is related to the fundamental question on when to
> declare the OOM event. It's a trade-off between the risk of
> unnecessary OOM kills and associated work losses and the risk of
> infinite trashing and effective soft lockups.  In the last few years
> several PSI-based userspace solutions were developed (e.g. OOMd [3] or
> systemd-OOMd [4]

and Android's LMKD (https://source.android.com/docs/core/perf/lmkd) uses
PSI too.

> ). The common idea was to use userspace daemons to
> implement custom OOM logic as well as rely on PSI monitoring to avoid
> stalls. In this scenario the userspace daemon was supposed to handle
> the majority of OOMs, while the in-kernel OOM killer worked as the
> last resort measure to guarantee that the system would never deadlock
> on the memory. But this approach creates additional infrastructure
> churn: userspace OOM daemon is a separate entity which needs to be
> deployed, updated, monitored. A completely different pipeline needs to
> be built to monitor both types of OOM events and collect associated
> logs. A userspace daemon is more restricted in terms on what data is
> available to it. Implementing a daemon which can work reliably under a
> heavy memory pressure in the system is also tricky.

Thanks for raising this and it is really challenging on very aggressive
overcommitted system. The userspace oom-killer needs cpu (or scheduling)
and memory guarantees as it needs to run and collect stats to decide who
to kill. Even with that, it can still get stuck in some global kernel
locks (I remember at Google I have seen their userspace oom-killer which
was a thread in borglet stuck on cgroup mutex or kernfs lock or
something). Anyways I see a lot of potential of this BPF based
oom-killer.

Orthogonally I am wondering if we can enable actions other than killing.
For example some workloads might prefer to get frozen or migrated away
instead of being killed.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 02/14] bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL
  2025-08-20  9:17   ` Kumar Kartikeya Dwivedi
@ 2025-08-20 22:32     ` Roman Gushchin
  0 siblings, 0 replies; 67+ messages in thread
From: Roman Gushchin @ 2025-08-20 22:32 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: linux-mm, bpf, Suren Baghdasaryan, Johannes Weiner, Michal Hocko,
	David Rientjes, Matt Bobrowski, Song Liu, Alexei Starovoitov,
	Andrew Morton, linux-kernel

Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:

> On Mon, 18 Aug 2025 at 19:01, Roman Gushchin <roman.gushchin@linux.dev> wrote:
>>
>> Struct oom_control is used to describe the OOM context.
>> It's memcg field defines the scope of OOM: it's NULL for global
>> OOMs and a valid memcg pointer for memcg-scoped OOMs.
>> Teach bpf verifier to recognize it as trusted or NULL pointer.
>> It will provide the bpf OOM handler a trusted memcg pointer,
>> which for example is required for iterating the memcg's subtree.
>>
>> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
>> ---
>
> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>

Thanks!


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 04/14] mm: introduce bpf kfuncs to deal with memcg pointers
  2025-08-20  9:21   ` Kumar Kartikeya Dwivedi
@ 2025-08-20 22:43     ` Roman Gushchin
  2025-08-20 23:33       ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 67+ messages in thread
From: Roman Gushchin @ 2025-08-20 22:43 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: linux-mm, bpf, Suren Baghdasaryan, Johannes Weiner, Michal Hocko,
	David Rientjes, Matt Bobrowski, Song Liu, Alexei Starovoitov,
	Andrew Morton, linux-kernel

Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:

> On Mon, 18 Aug 2025 at 19:02, Roman Gushchin <roman.gushchin@linux.dev> wrote:
>>
>> To effectively operate with memory cgroups in bpf there is a need
>> to convert css pointers to memcg pointers. A simple container_of
>> cast which is used in the kernel code can't be used in bpf because
>> from the verifier's point of view that's a out-of-bounds memory access.
>>
>> Introduce helper get/put kfuncs which can be used to get
>> a refcounted memcg pointer from the css pointer:
>>   - bpf_get_mem_cgroup,
>>   - bpf_put_mem_cgroup.
>>
>> bpf_get_mem_cgroup() can take both memcg's css and the corresponding
>> cgroup's "self" css. It allows it to be used with the existing cgroup
>> iterator which iterates over cgroup tree, not memcg tree.
>>
>> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
>> ---
>>  include/linux/memcontrol.h |   2 +
>>  mm/Makefile                |   1 +
>>  mm/bpf_memcontrol.c        | 151 +++++++++++++++++++++++++++++++++++++
>>  3 files changed, 154 insertions(+)
>>  create mode 100644 mm/bpf_memcontrol.c
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 87b6688f124a..785a064000cd 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -932,6 +932,8 @@ static inline void mod_memcg_page_state(struct page *page,
>>         rcu_read_unlock();
>>  }
>>
>> +unsigned long memcg_events(struct mem_cgroup *memcg, int event);
>> +unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap);
>>  unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx);
>>  unsigned long lruvec_page_state(struct lruvec *lruvec, enum node_stat_item idx);
>>  unsigned long lruvec_page_state_local(struct lruvec *lruvec,
>> diff --git a/mm/Makefile b/mm/Makefile
>> index a714aba03759..c397af904a87 100644
>> --- a/mm/Makefile
>> +++ b/mm/Makefile
>> @@ -107,6 +107,7 @@ obj-$(CONFIG_MEMCG) += swap_cgroup.o
>>  endif
>>  ifdef CONFIG_BPF_SYSCALL
>>  obj-y += bpf_oom.o
>> +obj-$(CONFIG_MEMCG) += bpf_memcontrol.o
>>  endif
>>  obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
>>  obj-$(CONFIG_GUP_TEST) += gup_test.o
>> diff --git a/mm/bpf_memcontrol.c b/mm/bpf_memcontrol.c
>> new file mode 100644
>> index 000000000000..66f2a359af7e
>> --- /dev/null
>> +++ b/mm/bpf_memcontrol.c
>> @@ -0,0 +1,151 @@
>> +// SPDX-License-Identifier: GPL-2.0-or-later
>> +/*
>> + * Memory Controller-related BPF kfuncs and auxiliary code
>> + *
>> + * Author: Roman Gushchin <roman.gushchin@linux.dev>
>> + */
>> +
>> +#include <linux/memcontrol.h>
>> +#include <linux/bpf.h>
>> +
>> +__bpf_kfunc_start_defs();
>> +
>> +/**
>> + * bpf_get_mem_cgroup - Get a reference to a memory cgroup
>> + * @css: pointer to the css structure
>> + *
>> + * Returns a pointer to a mem_cgroup structure after bumping
>> + * the corresponding css's reference counter.
>> + *
>> + * It's fine to pass a css which belongs to any cgroup controller,
>> + * e.g. unified hierarchy's main css.
>> + *
>> + * Implements KF_ACQUIRE semantics.
>> + */
>> +__bpf_kfunc struct mem_cgroup *
>> +bpf_get_mem_cgroup(struct cgroup_subsys_state *css)
>> +{
>> +       struct mem_cgroup *memcg = NULL;
>> +       bool rcu_unlock = false;
>> +
>> +       if (!root_mem_cgroup)
>> +               return NULL;
>> +
>> +       if (root_mem_cgroup->css.ss != css->ss) {
>> +               struct cgroup *cgroup = css->cgroup;
>> +               int ssid = root_mem_cgroup->css.ss->id;
>> +
>> +               rcu_read_lock();
>> +               rcu_unlock = true;
>> +               css = rcu_dereference_raw(cgroup->subsys[ssid]);
>> +       }
>> +
>> +       if (css && css_tryget(css))
>> +               memcg = container_of(css, struct mem_cgroup, css);
>> +
>> +       if (rcu_unlock)
>> +               rcu_read_unlock();
>> +
>> +       return memcg;
>> +}
>> +
>> +/**
>> + * bpf_put_mem_cgroup - Put a reference to a memory cgroup
>> + * @memcg: memory cgroup to release
>> + *
>> + * Releases a previously acquired memcg reference.
>> + * Implements KF_RELEASE semantics.
>> + */
>> +__bpf_kfunc void bpf_put_mem_cgroup(struct mem_cgroup *memcg)
>> +{
>> +       css_put(&memcg->css);
>> +}
>> +
>> +/**
>> + * bpf_mem_cgroup_events - Read memory cgroup's event counter
>> + * @memcg: memory cgroup
>> + * @event: event idx
>> + *
>> + * Allows to read memory cgroup event counters.
>> + */
>> +__bpf_kfunc unsigned long bpf_mem_cgroup_events(struct mem_cgroup *memcg, int event)
>> +{
>> +
>> +       if (event < 0 || event >= NR_VM_EVENT_ITEMS)
>> +               return (unsigned long)-1;
>> +
>> +       return memcg_events(memcg, event);
>> +}
>> +
>> +/**
>> + * bpf_mem_cgroup_usage - Read memory cgroup's usage
>> + * @memcg: memory cgroup
>> + *
>> + * Returns current memory cgroup size in bytes.
>> + */
>> +__bpf_kfunc unsigned long bpf_mem_cgroup_usage(struct mem_cgroup *memcg)
>> +{
>> +       return page_counter_read(&memcg->memory);
>> +}
>> +
>> +/**
>> + * bpf_mem_cgroup_events - Read memory cgroup's page state counter
>> + * @memcg: memory cgroup
>> + * @event: event idx
>> + *
>> + * Allows to read memory cgroup statistics.
>> + */
>> +__bpf_kfunc unsigned long bpf_mem_cgroup_page_state(struct mem_cgroup *memcg, int idx)
>> +{
>> +       if (idx < 0 || idx >= MEMCG_NR_STAT)
>> +               return (unsigned long)-1;
>> +
>> +       return memcg_page_state(memcg, idx);
>> +}
>> +
>> +/**
>> + * bpf_mem_cgroup_flush_stats - Flush memory cgroup's statistics
>> + * @memcg: memory cgroup
>> + *
>> + * Propagate memory cgroup's statistics up the cgroup tree.
>> + *
>> + * Note, that this function uses the rate-limited version of
>> + * mem_cgroup_flush_stats() to avoid hurting the system-wide
>> + * performance. So bpf_mem_cgroup_flush_stats() guarantees only
>> + * that statistics is not stale beyond 2*FLUSH_TIME.
>> + */
>> +__bpf_kfunc void bpf_mem_cgroup_flush_stats(struct mem_cgroup *memcg)
>> +{
>> +       mem_cgroup_flush_stats_ratelimited(memcg);
>> +}
>> +
>> +__bpf_kfunc_end_defs();
>> +
>> +BTF_KFUNCS_START(bpf_memcontrol_kfuncs)
>> +BTF_ID_FLAGS(func, bpf_get_mem_cgroup, KF_ACQUIRE | KF_RET_NULL)
>
> I think you could set KF_TRUSTED_ARGS for this as well.

Not really. The intended use case is to iterate over the cgroup tree,
which gives non-trusted css pointers:
	bpf_for_each(css, css_pos, &root_memcg->css, BPF_CGROUP_ITER_DESCENDANTS_POST) {
		memcg = bpf_get_mem_cgroup(css_pos);
	}

Thanks


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 05/14] mm: introduce bpf_get_root_mem_cgroup() bpf kfunc
  2025-08-20  9:25   ` Kumar Kartikeya Dwivedi
@ 2025-08-20 22:45     ` Roman Gushchin
  0 siblings, 0 replies; 67+ messages in thread
From: Roman Gushchin @ 2025-08-20 22:45 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: linux-mm, bpf, Suren Baghdasaryan, Johannes Weiner, Michal Hocko,
	David Rientjes, Matt Bobrowski, Song Liu, Alexei Starovoitov,
	Andrew Morton, linux-kernel

Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:

> On Mon, 18 Aug 2025 at 19:02, Roman Gushchin <roman.gushchin@linux.dev> wrote:
>>
>> Introduce a bpf kfunc to get a trusted pointer to the root memory
>> cgroup. It's very handy to traverse the full memcg tree, e.g.
>> for handling a system-wide OOM.
>>
>> It's possible to obtain this pointer by traversing the memcg tree
>> up from any known memcg, but it's sub-optimal and makes bpf programs
>> more complex and less efficient.
>>
>> bpf_get_root_mem_cgroup() has a KF_ACQUIRE | KF_RET_NULL semantics,
>> however in reality it's not necessarily to bump the corresponding
>> reference counter - root memory cgroup is immortal, reference counting
>> is skipped, see css_get(). Once set, root_mem_cgroup is always a valid
>> memcg pointer. It's safe to call bpf_put_mem_cgroup() for the pointer
>> obtained with bpf_get_root_mem_cgroup(), it's effectively a no-op.
>>
>> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
>> ---
>>  mm/bpf_memcontrol.c | 15 +++++++++++++++
>>  1 file changed, 15 insertions(+)
>>
>> diff --git a/mm/bpf_memcontrol.c b/mm/bpf_memcontrol.c
>> index 66f2a359af7e..a8faa561bcba 100644
>> --- a/mm/bpf_memcontrol.c
>> +++ b/mm/bpf_memcontrol.c
>> @@ -10,6 +10,20 @@
>>
>>  __bpf_kfunc_start_defs();
>>
>> +/**
>> + * bpf_get_root_mem_cgroup - Returns a pointer to the root memory cgroup
>> + *
>> + * The function has KF_ACQUIRE semantics, even though the root memory
>> + * cgroup is never destroyed after being created and doesn't require
>> + * reference counting. And it's perfectly safe to pass it to
>> + * bpf_put_mem_cgroup()
>> + */
>> +__bpf_kfunc struct mem_cgroup *bpf_get_root_mem_cgroup(void)
>> +{
>> +       /* css_get() is not needed */
>> +       return root_mem_cgroup;
>> +}
>> +
>>  /**
>>   * bpf_get_mem_cgroup - Get a reference to a memory cgroup
>>   * @css: pointer to the css structure
>> @@ -122,6 +136,7 @@ __bpf_kfunc void bpf_mem_cgroup_flush_stats(struct mem_cgroup *memcg)
>>  __bpf_kfunc_end_defs();
>>
>>  BTF_KFUNCS_START(bpf_memcontrol_kfuncs)
>> +BTF_ID_FLAGS(func, bpf_get_root_mem_cgroup, KF_ACQUIRE | KF_RET_NULL)
>
> Same suggestion here (re: trusted args).

It's not really taking any arguments, so I don't think it's applicable:
	struct mem_cgroup *bpf_get_root_mem_cgroup(void)


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 10/14] bpf: selftests: bpf OOM handler test
  2025-08-20  9:33   ` Kumar Kartikeya Dwivedi
@ 2025-08-20 22:49     ` Roman Gushchin
  0 siblings, 0 replies; 67+ messages in thread
From: Roman Gushchin @ 2025-08-20 22:49 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: linux-mm, bpf, Suren Baghdasaryan, Johannes Weiner, Michal Hocko,
	David Rientjes, Matt Bobrowski, Song Liu, Alexei Starovoitov,
	Andrew Morton, linux-kernel

Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:

> On Mon, 18 Aug 2025 at 19:02, Roman Gushchin <roman.gushchin@linux.dev> wrote:
>>
>> Implement a pseudo-realistic test for the OOM handling
>> functionality.
>>
>> The OOM handling policy which is implemented in bpf is to
>> kill all tasks belonging to the biggest leaf cgroup, which
>> doesn't contain unkillable tasks (tasks with oom_score_adj
>> set to -1000). Pagecache size is excluded from the accounting.
>>
>> The test creates a hierarchy of memory cgroups, causes an
>> OOM at the top level, checks that the expected process will be
>> killed and checks memcg's oom statistics.
>>
>> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
>> ---
>>  [...]
>> +
>> +/*
>> + * Find the largest leaf cgroup (ignoring page cache) without unkillable tasks
>> + * and kill all belonging tasks.
>> + */
>> +SEC("struct_ops.s/handle_out_of_memory")
>> +int BPF_PROG(test_out_of_memory, struct oom_control *oc)
>> +{
>> +       struct task_struct *task;
>> +       struct mem_cgroup *root_memcg = oc->memcg;
>> +       struct mem_cgroup *memcg, *victim = NULL;
>> +       struct cgroup_subsys_state *css_pos;
>> +       unsigned long usage, max_usage = 0;
>> +       unsigned long pagecache = 0;
>> +       int ret = 0;
>> +
>> +       if (root_memcg)
>> +               root_memcg = bpf_get_mem_cgroup(&root_memcg->css);
>> +       else
>> +               root_memcg = bpf_get_root_mem_cgroup();
>> +
>> +       if (!root_memcg)
>> +               return 0;
>> +
>> +       bpf_rcu_read_lock();
>> +       bpf_for_each(css, css_pos, &root_memcg->css, BPF_CGROUP_ITER_DESCENDANTS_POST) {
>> +               if (css_pos->cgroup->nr_descendants + css_pos->cgroup->nr_dying_descendants)
>> +                       continue;
>> +
>> +               memcg = bpf_get_mem_cgroup(css_pos);
>> +               if (!memcg)
>> +                       continue;
>> +
>> +               usage = bpf_mem_cgroup_usage(memcg);
>> +               pagecache = bpf_mem_cgroup_page_state(memcg, NR_FILE_PAGES);
>> +
>> +               if (usage > pagecache)
>> +                       usage -= pagecache;
>> +               else
>> +                       usage = 0;
>> +
>> +               if ((usage > max_usage) && mem_cgroup_killable(memcg)) {
>> +                       max_usage = usage;
>> +                       if (victim)
>> +                               bpf_put_mem_cgroup(victim);
>> +                       victim = bpf_get_mem_cgroup(&memcg->css);
>> +               }
>> +
>> +               bpf_put_mem_cgroup(memcg);
>> +       }
>> +       bpf_rcu_read_unlock();
>> +
>> +       if (!victim)
>> +               goto exit;
>> +
>> +       bpf_for_each(css_task, task, &victim->css, CSS_TASK_ITER_PROCS) {
>> +               struct task_struct *t = bpf_task_acquire(task);
>> +
>> +               if (t) {
>> +                       if (!bpf_task_is_oom_victim(task))
>> +                               bpf_oom_kill_process(oc, task, "bpf oom test");
>
> Is there a scenario where we want to invoke bpf_oom_kill_process when
> the task is not an oom victim?

Not really, but...

> Would it be better to subsume this check in the kfunc itself?

bpf_task_is_oom_victim() is useful by itself, because if we see
a task which is about to be killed, we can likely simple bail out.
Let me adjust the test to reflect it.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 06/14] mm: introduce bpf_out_of_memory() bpf kfunc
  2025-08-20  9:34   ` Kumar Kartikeya Dwivedi
@ 2025-08-20 22:59     ` Roman Gushchin
  0 siblings, 0 replies; 67+ messages in thread
From: Roman Gushchin @ 2025-08-20 22:59 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: linux-mm, bpf, Suren Baghdasaryan, Johannes Weiner, Michal Hocko,
	David Rientjes, Matt Bobrowski, Song Liu, Alexei Starovoitov,
	Andrew Morton, linux-kernel

Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:

> On Mon, 18 Aug 2025 at 19:02, Roman Gushchin <roman.gushchin@linux.dev> wrote:
>>
>> Introduce bpf_out_of_memory() bpf kfunc, which allows to declare
>> an out of memory events and trigger the corresponding kernel OOM
>> handling mechanism.
>>
>> It takes a trusted memcg pointer (or NULL for system-wide OOMs)
>> as an argument, as well as the page order.
>>
>> If the wait_on_oom_lock argument is not set, only one OOM can be
>> declared and handled in the system at once, so if the function is
>> called in parallel to another OOM handling, it bails out with -EBUSY.
>> This mode is suited for global OOM's: any concurrent OOMs will likely
>> do the job and release some memory. In a blocking mode (which is
>> suited for memcg OOMs) the execution will wait on the oom_lock mutex.
>>
>> The function is declared as sleepable. It guarantees that it won't
>> be called from an atomic context. It's required by the OOM handling
>> code, which is not guaranteed to work in a non-blocking context.
>>
>> Handling of a memcg OOM almost always requires taking of the
>> css_set_lock spinlock. The fact that bpf_out_of_memory() is sleepable
>> also guarantees that it can't be called with acquired css_set_lock,
>> so the kernel can't deadlock on it.
>>
>> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
>> ---
>>  mm/oom_kill.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 45 insertions(+)
>>
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> index 25fc5e744e27..df409f0fac45 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -1324,10 +1324,55 @@ __bpf_kfunc int bpf_oom_kill_process(struct oom_control *oc,
>>         return 0;
>>  }
>>
>> +/**
>> + * bpf_out_of_memory - declare Out Of Memory state and invoke OOM killer
>> + * @memcg__nullable: memcg or NULL for system-wide OOMs
>> + * @order: order of page which wasn't allocated
>> + * @wait_on_oom_lock: if true, block on oom_lock
>> + * @constraint_text__nullable: custom constraint description for the OOM report
>> + *
>> + * Declares the Out Of Memory state and invokes the OOM killer.
>> + *
>> + * OOM handlers are synchronized using the oom_lock mutex. If wait_on_oom_lock
>> + * is true, the function will wait on it. Otherwise it bails out with -EBUSY
>> + * if oom_lock is contended.
>> + *
>> + * Generally it's advised to pass wait_on_oom_lock=true for global OOMs
>> + * and wait_on_oom_lock=false for memcg-scoped OOMs.
>> + *
>> + * Returns 1 if the forward progress was achieved and some memory was freed.
>> + * Returns a negative value if an error has been occurred.
>> + */
>> +__bpf_kfunc int bpf_out_of_memory(struct mem_cgroup *memcg__nullable,
>> +                                 int order, bool wait_on_oom_lock)
>
> I think this bool should be a u64 flags instead, just to make it
> easier to extend behavior in the future.

I like it, will change in the next version.

Thanks for the idea and also for reviewing the series!


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 04/14] mm: introduce bpf kfuncs to deal with memcg pointers
  2025-08-20 22:43     ` Roman Gushchin
@ 2025-08-20 23:33       ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-08-20 23:33 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, bpf, Suren Baghdasaryan, Johannes Weiner, Michal Hocko,
	David Rientjes, Matt Bobrowski, Song Liu, Alexei Starovoitov,
	Andrew Morton, linux-kernel

On Thu, 21 Aug 2025 at 00:43, Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:
>
> > On Mon, 18 Aug 2025 at 19:02, Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >>
> >> To effectively operate with memory cgroups in bpf there is a need
> >> to convert css pointers to memcg pointers. A simple container_of
> >> cast which is used in the kernel code can't be used in bpf because
> >> from the verifier's point of view that's a out-of-bounds memory access.
> >>
> >> Introduce helper get/put kfuncs which can be used to get
> >> a refcounted memcg pointer from the css pointer:
> >>   - bpf_get_mem_cgroup,
> >>   - bpf_put_mem_cgroup.
> >>
> >> bpf_get_mem_cgroup() can take both memcg's css and the corresponding
> >> cgroup's "self" css. It allows it to be used with the existing cgroup
> >> iterator which iterates over cgroup tree, not memcg tree.
> >>
> >> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> >> ---
> >>  include/linux/memcontrol.h |   2 +
> >>  mm/Makefile                |   1 +
> >>  mm/bpf_memcontrol.c        | 151 +++++++++++++++++++++++++++++++++++++
> >>  3 files changed, 154 insertions(+)
> >>  create mode 100644 mm/bpf_memcontrol.c
> >>
> >> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> >> index 87b6688f124a..785a064000cd 100644
> >> --- a/include/linux/memcontrol.h
> >> +++ b/include/linux/memcontrol.h
> >> @@ -932,6 +932,8 @@ static inline void mod_memcg_page_state(struct page *page,
> >>         rcu_read_unlock();
> >>  }
> >>
> >> +unsigned long memcg_events(struct mem_cgroup *memcg, int event);
> >> +unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap);
> >>  unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx);
> >>  unsigned long lruvec_page_state(struct lruvec *lruvec, enum node_stat_item idx);
> >>  unsigned long lruvec_page_state_local(struct lruvec *lruvec,
> >> diff --git a/mm/Makefile b/mm/Makefile
> >> index a714aba03759..c397af904a87 100644
> >> --- a/mm/Makefile
> >> +++ b/mm/Makefile
> >> @@ -107,6 +107,7 @@ obj-$(CONFIG_MEMCG) += swap_cgroup.o
> >>  endif
> >>  ifdef CONFIG_BPF_SYSCALL
> >>  obj-y += bpf_oom.o
> >> +obj-$(CONFIG_MEMCG) += bpf_memcontrol.o
> >>  endif
> >>  obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
> >>  obj-$(CONFIG_GUP_TEST) += gup_test.o
> >> diff --git a/mm/bpf_memcontrol.c b/mm/bpf_memcontrol.c
> >> new file mode 100644
> >> index 000000000000..66f2a359af7e
> >> --- /dev/null
> >> +++ b/mm/bpf_memcontrol.c
> >> @@ -0,0 +1,151 @@
> >> +// SPDX-License-Identifier: GPL-2.0-or-later
> >> +/*
> >> + * Memory Controller-related BPF kfuncs and auxiliary code
> >> + *
> >> + * Author: Roman Gushchin <roman.gushchin@linux.dev>
> >> + */
> >> +
> >> +#include <linux/memcontrol.h>
> >> +#include <linux/bpf.h>
> >> +
> >> +__bpf_kfunc_start_defs();
> >> +
> >> +/**
> >> + * bpf_get_mem_cgroup - Get a reference to a memory cgroup
> >> + * @css: pointer to the css structure
> >> + *
> >> + * Returns a pointer to a mem_cgroup structure after bumping
> >> + * the corresponding css's reference counter.
> >> + *
> >> + * It's fine to pass a css which belongs to any cgroup controller,
> >> + * e.g. unified hierarchy's main css.
> >> + *
> >> + * Implements KF_ACQUIRE semantics.
> >> + */
> >> +__bpf_kfunc struct mem_cgroup *
> >> +bpf_get_mem_cgroup(struct cgroup_subsys_state *css)
> >> +{
> >> +       struct mem_cgroup *memcg = NULL;
> >> +       bool rcu_unlock = false;
> >> +
> >> +       if (!root_mem_cgroup)
> >> +               return NULL;
> >> +
> >> +       if (root_mem_cgroup->css.ss != css->ss) {
> >> +               struct cgroup *cgroup = css->cgroup;
> >> +               int ssid = root_mem_cgroup->css.ss->id;
> >> +
> >> +               rcu_read_lock();
> >> +               rcu_unlock = true;
> >> +               css = rcu_dereference_raw(cgroup->subsys[ssid]);
> >> +       }
> >> +
> >> +       if (css && css_tryget(css))
> >> +               memcg = container_of(css, struct mem_cgroup, css);
> >> +
> >> +       if (rcu_unlock)
> >> +               rcu_read_unlock();
> >> +
> >> +       return memcg;
> >> +}
> >> +
> >> +/**
> >> + * bpf_put_mem_cgroup - Put a reference to a memory cgroup
> >> + * @memcg: memory cgroup to release
> >> + *
> >> + * Releases a previously acquired memcg reference.
> >> + * Implements KF_RELEASE semantics.
> >> + */
> >> +__bpf_kfunc void bpf_put_mem_cgroup(struct mem_cgroup *memcg)
> >> +{
> >> +       css_put(&memcg->css);
> >> +}
> >> +
> >> +/**
> >> + * bpf_mem_cgroup_events - Read memory cgroup's event counter
> >> + * @memcg: memory cgroup
> >> + * @event: event idx
> >> + *
> >> + * Allows to read memory cgroup event counters.
> >> + */
> >> +__bpf_kfunc unsigned long bpf_mem_cgroup_events(struct mem_cgroup *memcg, int event)
> >> +{
> >> +
> >> +       if (event < 0 || event >= NR_VM_EVENT_ITEMS)
> >> +               return (unsigned long)-1;
> >> +
> >> +       return memcg_events(memcg, event);
> >> +}
> >> +
> >> +/**
> >> + * bpf_mem_cgroup_usage - Read memory cgroup's usage
> >> + * @memcg: memory cgroup
> >> + *
> >> + * Returns current memory cgroup size in bytes.
> >> + */
> >> +__bpf_kfunc unsigned long bpf_mem_cgroup_usage(struct mem_cgroup *memcg)
> >> +{
> >> +       return page_counter_read(&memcg->memory);
> >> +}
> >> +
> >> +/**
> >> + * bpf_mem_cgroup_events - Read memory cgroup's page state counter
> >> + * @memcg: memory cgroup
> >> + * @event: event idx
> >> + *
> >> + * Allows to read memory cgroup statistics.
> >> + */
> >> +__bpf_kfunc unsigned long bpf_mem_cgroup_page_state(struct mem_cgroup *memcg, int idx)
> >> +{
> >> +       if (idx < 0 || idx >= MEMCG_NR_STAT)
> >> +               return (unsigned long)-1;
> >> +
> >> +       return memcg_page_state(memcg, idx);
> >> +}
> >> +
> >> +/**
> >> + * bpf_mem_cgroup_flush_stats - Flush memory cgroup's statistics
> >> + * @memcg: memory cgroup
> >> + *
> >> + * Propagate memory cgroup's statistics up the cgroup tree.
> >> + *
> >> + * Note, that this function uses the rate-limited version of
> >> + * mem_cgroup_flush_stats() to avoid hurting the system-wide
> >> + * performance. So bpf_mem_cgroup_flush_stats() guarantees only
> >> + * that statistics is not stale beyond 2*FLUSH_TIME.
> >> + */
> >> +__bpf_kfunc void bpf_mem_cgroup_flush_stats(struct mem_cgroup *memcg)
> >> +{
> >> +       mem_cgroup_flush_stats_ratelimited(memcg);
> >> +}
> >> +
> >> +__bpf_kfunc_end_defs();
> >> +
> >> +BTF_KFUNCS_START(bpf_memcontrol_kfuncs)
> >> +BTF_ID_FLAGS(func, bpf_get_mem_cgroup, KF_ACQUIRE | KF_RET_NULL)
> >
> > I think you could set KF_TRUSTED_ARGS for this as well.
>
> Not really. The intended use case is to iterate over the cgroup tree,
> which gives non-trusted css pointers:
>         bpf_for_each(css, css_pos, &root_memcg->css, BPF_CGROUP_ITER_DESCENDANTS_POST) {
>                 memcg = bpf_get_mem_cgroup(css_pos);
>         }

Then I assume they're at least RCU protected? You could relax it from
trusted to KF_RCU (since I see css_tryget internally).
Otherwise the default behavior is unconstrained (any ptr matching that
type obtained from random walks --- which is something to fix, but
until then we have to actively mark for taking safe arguments).

>
> Thanks


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 12/14] sched: psi: implement psi trigger handling using bpf
  2025-08-19 23:31       ` Roman Gushchin
@ 2025-08-20 23:56         ` Suren Baghdasaryan
  0 siblings, 0 replies; 67+ messages in thread
From: Suren Baghdasaryan @ 2025-08-20 23:56 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, bpf, Johannes Weiner, Michal Hocko, David Rientjes,
	Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel

On Tue, Aug 19, 2025 at 4:31 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> Roman Gushchin <roman.gushchin@linux.dev> writes:
>
> > Suren Baghdasaryan <surenb@google.com> writes:
> >
> >> On Mon, Aug 18, 2025 at 10:02 AM Roman Gushchin
> >> <roman.gushchin@linux.dev> wrote:
> >
> >>
> >>> +
> >>> +       /* Cgroup Id */
> >>> +       u64 cgroup_id;
> >>
> >> This cgroup_id field is weird. It's not initialized and not used here,
> >> then it gets initialized in the next patch and used in the last patch
> >> from a selftest. This is quite confusing. Also logically I don't think
> >> a cgroup attribute really belongs to psi_trigger... Can we at least
> >> move it into bpf_psi where it might fit a bit better?
> >
> > I can't move it to bpf_psi, because a single bpf_psi might own multiple
> > triggers with different cgroup_id's.
> > For sure I can move it to the next patch, if it's preferred.
> >
> > If you really don't like it here, other option is to replace it with
> > a new bpf helper (kfunc) which calculates the cgroup_id by walking the
> > trigger->group->cgroup->cgroup_id path each time.
>
> Actually there is no easy path from psi_group to cgroup, so there is
> no such option available, unfortunately. Or we need a back-link from
> the psi_group to cgroup.

Ok, I obviously missed some important relations between these
structures. Let me digest it some more before commenting further.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 00/14] mm: BPF OOM
  2025-08-20 21:06 ` Shakeel Butt
@ 2025-08-21  0:01   ` Roman Gushchin
  0 siblings, 0 replies; 67+ messages in thread
From: Roman Gushchin @ 2025-08-21  0:01 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: linux-mm, bpf, Suren Baghdasaryan, Johannes Weiner, Michal Hocko,
	David Rientjes, Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel

Shakeel Butt <shakeel.butt@linux.dev> writes:

> On Mon, Aug 18, 2025 at 10:01:22AM -0700, Roman Gushchin wrote:
>> This patchset adds an ability to customize the out of memory
>> handling using bpf.
>> 
>> It focuses on two parts:
>> 1) OOM handling policy,
>> 2) PSI-based OOM invocation.
>> 
>> The idea to use bpf for customizing the OOM handling is not new, but
>> unlike the previous proposal [1], which augmented the existing task
>> ranking policy, this one tries to be as generic as possible and
>> leverage the full power of the modern bpf.
>> 
>> It provides a generic interface which is called before the existing OOM
>> killer code and allows implementing any policy, e.g. picking a victim
>> task or memory cgroup or potentially even releasing memory in other
>> ways, e.g. deleting tmpfs files (the last one might require some
>> additional but relatively simple changes).
>
> The releasing memory part is really interesting and useful. I can see
> much more reliable and targetted oom reaping with this approach.
>
>> 
>> The past attempt to implement memory-cgroup aware policy [2] showed
>> that there are multiple opinions on what the best policy is.  As it's
>> highly workload-dependent and specific to a concrete way of organizing
>> workloads, the structure of the cgroup tree etc,
>
> and user space policies like Google has very clear priorities among
> concurrently running workloads while many other users do not.
>
>> a customizable
>> bpf-based implementation is preferable over a in-kernel implementation
>> with a dozen on sysctls.
>
> +1
>
>> 
>> The second part is related to the fundamental question on when to
>> declare the OOM event. It's a trade-off between the risk of
>> unnecessary OOM kills and associated work losses and the risk of
>> infinite trashing and effective soft lockups.  In the last few years
>> several PSI-based userspace solutions were developed (e.g. OOMd [3] or
>> systemd-OOMd [4]
>
> and Android's LMKD (https://source.android.com/docs/core/perf/lmkd) uses
> PSI too.
>
>> ). The common idea was to use userspace daemons to
>> implement custom OOM logic as well as rely on PSI monitoring to avoid
>> stalls. In this scenario the userspace daemon was supposed to handle
>> the majority of OOMs, while the in-kernel OOM killer worked as the
>> last resort measure to guarantee that the system would never deadlock
>> on the memory. But this approach creates additional infrastructure
>> churn: userspace OOM daemon is a separate entity which needs to be
>> deployed, updated, monitored. A completely different pipeline needs to
>> be built to monitor both types of OOM events and collect associated
>> logs. A userspace daemon is more restricted in terms on what data is
>> available to it. Implementing a daemon which can work reliably under a
>> heavy memory pressure in the system is also tricky.
>
> Thanks for raising this and it is really challenging on very aggressive
> overcommitted system. The userspace oom-killer needs cpu (or scheduling)
> and memory guarantees as it needs to run and collect stats to decide who
> to kill. Even with that, it can still get stuck in some global kernel
> locks (I remember at Google I have seen their userspace oom-killer which
> was a thread in borglet stuck on cgroup mutex or kernfs lock or
> something). Anyways I see a lot of potential of this BPF based
> oom-killer.
>
> Orthogonally I am wondering if we can enable actions other than killing.
> For example some workloads might prefer to get frozen or migrated away
> instead of being killed.

Absolutely, PSI events handling in the kernel (via BPF) opens a broad
range of possibilities. e.g. we can tune cgroup knobs, freeze/unfreeze
tasks, remove tmpfs files, promote/demote memory to other tiers, etc.
I was also thinking about tuning the readahead based on the memory
pressure.

Thanks!


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 10/14] bpf: selftests: bpf OOM handler test
  2025-08-20 20:23   ` Andrii Nakryiko
@ 2025-08-21  0:10     ` Roman Gushchin
  0 siblings, 0 replies; 67+ messages in thread
From: Roman Gushchin @ 2025-08-21  0:10 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-mm, bpf, Suren Baghdasaryan, Johannes Weiner, Michal Hocko,
	David Rientjes, Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel

Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Mon, Aug 18, 2025 at 10:05 AM Roman Gushchin
> <roman.gushchin@linux.dev> wrote:
>>
>> Implement a pseudo-realistic test for the OOM handling
>> functionality.
>>
>> The OOM handling policy which is implemented in bpf is to
>> kill all tasks belonging to the biggest leaf cgroup, which
>> doesn't contain unkillable tasks (tasks with oom_score_adj
>> set to -1000). Pagecache size is excluded from the accounting.
>>
>> The test creates a hierarchy of memory cgroups, causes an
>> OOM at the top level, checks that the expected process will be
>> killed and checks memcg's oom statistics.
>>
>> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
>> ---
>>  .../selftests/bpf/prog_tests/test_oom.c       | 229 ++++++++++++++++++
>>  tools/testing/selftests/bpf/progs/test_oom.c  | 108 +++++++++
>>  2 files changed, 337 insertions(+)
>>  create mode 100644 tools/testing/selftests/bpf/prog_tests/test_oom.c
>>  create mode 100644 tools/testing/selftests/bpf/progs/test_oom.c
>>
>> diff --git a/tools/testing/selftests/bpf/prog_tests/test_oom.c b/tools/testing/selftests/bpf/prog_tests/test_oom.c
>> new file mode 100644
>> index 000000000000..eaeb14a9d18f
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/prog_tests/test_oom.c
>> @@ -0,0 +1,229 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +#include <test_progs.h>
>> +#include <bpf/btf.h>
>> +#include <bpf/bpf.h>
>> +
>> +#include "cgroup_helpers.h"
>> +#include "test_oom.skel.h"
>> +
>> +struct cgroup_desc {
>> +       const char *path;
>> +       int fd;
>> +       unsigned long long id;
>> +       int pid;
>> +       size_t target;
>> +       size_t max;
>> +       int oom_score_adj;
>> +       bool victim;
>> +};
>> +
>> +#define MB (1024 * 1024)
>> +#define OOM_SCORE_ADJ_MIN      (-1000)
>> +#define OOM_SCORE_ADJ_MAX      1000
>> +
>> +static struct cgroup_desc cgroups[] = {
>> +       { .path = "/oom_test", .max = 80 * MB},
>> +       { .path = "/oom_test/cg1", .target = 10 * MB,
>> +         .oom_score_adj = OOM_SCORE_ADJ_MAX },
>> +       { .path = "/oom_test/cg2", .target = 40 * MB,
>> +         .oom_score_adj = OOM_SCORE_ADJ_MIN },
>> +       { .path = "/oom_test/cg3" },
>> +       { .path = "/oom_test/cg3/cg4", .target = 30 * MB,
>> +         .victim = true },
>> +       { .path = "/oom_test/cg3/cg5", .target = 20 * MB },
>> +};
>> +
>> +static int spawn_task(struct cgroup_desc *desc)
>> +{
>> +       char *ptr;
>> +       int pid;
>> +
>> +       pid = fork();
>> +       if (pid < 0)
>> +               return pid;
>> +
>> +       if (pid > 0) {
>> +               /* parent */
>> +               desc->pid = pid;
>> +               return 0;
>> +       }
>> +
>> +       /* child */
>> +       if (desc->oom_score_adj) {
>> +               char buf[64];
>> +               int fd = open("/proc/self/oom_score_adj", O_WRONLY);
>> +
>> +               if (fd < 0)
>> +                       return -1;
>> +
>> +               snprintf(buf, sizeof(buf), "%d", desc->oom_score_adj);
>> +               write(fd, buf, sizeof(buf));
>> +               close(fd);
>> +       }
>> +
>> +       ptr = (char *)malloc(desc->target);
>> +       if (!ptr)
>> +               return -ENOMEM;
>> +
>> +       memset(ptr, 'a', desc->target);
>> +
>> +       while (1)
>> +               sleep(1000);
>> +
>> +       return 0;
>> +}
>> +
>> +static void setup_environment(void)
>> +{
>> +       int i, err;
>> +
>> +       err = setup_cgroup_environment();
>> +       if (!ASSERT_OK(err, "setup_cgroup_environment"))
>> +               goto cleanup;
>> +
>> +       for (i = 0; i < ARRAY_SIZE(cgroups); i++) {
>> +               cgroups[i].fd = create_and_get_cgroup(cgroups[i].path);
>> +               if (!ASSERT_GE(cgroups[i].fd, 0, "create_and_get_cgroup"))
>> +                       goto cleanup;
>> +
>> +               cgroups[i].id = get_cgroup_id(cgroups[i].path);
>> +               if (!ASSERT_GT(cgroups[i].id, 0, "get_cgroup_id"))
>> +                       goto cleanup;
>> +
>> +               /* Freeze the top-level cgroup */
>> +               if (i == 0) {
>> +                       /* Freeze the top-level cgroup */
>> +                       err = write_cgroup_file(cgroups[i].path, "cgroup.freeze", "1");
>> +                       if (!ASSERT_OK(err, "freeze cgroup"))
>> +                               goto cleanup;
>> +               }
>> +
>> +               /* Recursively enable the memory controller */
>> +               if (!cgroups[i].target) {
>> +
>> +                       err = write_cgroup_file(cgroups[i].path, "cgroup.subtree_control",
>> +                                               "+memory");
>> +                       if (!ASSERT_OK(err, "enable memory controller"))
>> +                               goto cleanup;
>> +               }
>> +
>> +               /* Set memory.max */
>> +               if (cgroups[i].max) {
>> +                       char buf[256];
>> +
>> +                       snprintf(buf, sizeof(buf), "%lu", cgroups[i].max);
>> +                       err = write_cgroup_file(cgroups[i].path, "memory.max", buf);
>> +                       if (!ASSERT_OK(err, "set memory.max"))
>> +                               goto cleanup;
>> +
>> +                       snprintf(buf, sizeof(buf), "0");
>> +                       write_cgroup_file(cgroups[i].path, "memory.swap.max", buf);
>> +
>> +               }
>> +
>> +               /* Spawn tasks creating memory pressure */
>> +               if (cgroups[i].target) {
>> +                       char buf[256];
>> +
>> +                       err = spawn_task(&cgroups[i]);
>> +                       if (!ASSERT_OK(err, "spawn task"))
>> +                               goto cleanup;
>> +
>> +                       snprintf(buf, sizeof(buf), "%d", cgroups[i].pid);
>> +                       err = write_cgroup_file(cgroups[i].path, "cgroup.procs", buf);
>> +                       if (!ASSERT_OK(err, "put child into a cgroup"))
>> +                               goto cleanup;
>> +               }
>> +       }
>> +
>> +       return;
>> +
>> +cleanup:
>> +       cleanup_cgroup_environment();
>> +}
>> +
>> +static int run_and_wait_for_oom(void)
>> +{
>> +       int ret = -1;
>> +       bool first = true;
>> +       char buf[4096] = {};
>> +       size_t size;
>> +
>> +       /* Unfreeze the top-level cgroup */
>> +       ret = write_cgroup_file(cgroups[0].path, "cgroup.freeze", "0");
>> +       if (!ASSERT_OK(ret, "freeze cgroup"))
>> +               return -1;
>> +
>> +       for (;;) {
>> +               int i, status;
>> +               pid_t pid = wait(&status);
>> +
>> +               if (pid == -1) {
>> +                       if (errno == EINTR)
>> +                               continue;
>> +                       /* ECHILD */
>> +                       break;
>> +               }
>> +
>> +               if (!first)
>> +                       continue;
>> +
>> +               first = false;
>> +
>> +               /* Check which process was terminated first */
>> +               for (i = 0; i < ARRAY_SIZE(cgroups); i++) {
>> +                       if (!ASSERT_OK(cgroups[i].victim !=
>> +                                      (pid == cgroups[i].pid),
>> +                                      "correct process was killed")) {
>> +                               ret = -1;
>> +                               break;
>> +                       }
>> +
>> +                       if (!cgroups[i].victim)
>> +                               continue;
>> +
>> +                       /* Check the memcg oom counter */
>> +                       size = read_cgroup_file(cgroups[i].path,
>> +                                               "memory.events",
>> +                                               buf, sizeof(buf));
>> +                       if (!ASSERT_OK(size <= 0, "read memory.events")) {
>> +                               ret = -1;
>> +                               break;
>> +                       }
>> +
>> +                       if (!ASSERT_OK(strstr(buf, "oom_kill 1") == NULL,
>> +                                      "oom_kill count check")) {
>> +                               ret = -1;
>> +                               break;
>> +                       }
>> +               }
>> +
>> +               /* Kill all remaining tasks */
>> +               for (i = 0; i < ARRAY_SIZE(cgroups); i++)
>> +                       if (cgroups[i].pid && cgroups[i].pid != pid)
>> +                               kill(cgroups[i].pid, SIGKILL);
>> +       }
>> +
>> +       return ret;
>> +}
>> +
>> +void test_oom(void)
>> +{
>> +       struct test_oom *skel;
>> +       int err;
>> +
>> +       setup_environment();
>> +
>> +       skel = test_oom__open_and_load();
>> +       err = test_oom__attach(skel);
>> +       if (CHECK_FAIL(err))
>> +               goto cleanup;
>> +
>> +       /* Unfreeze all child tasks and create the memory pressure */
>> +       err = run_and_wait_for_oom();
>> +       CHECK_FAIL(err);
>> +
>> +cleanup:
>> +       cleanup_cgroup_environment();
>> +       test_oom__destroy(skel);
>> +}
>> diff --git a/tools/testing/selftests/bpf/progs/test_oom.c b/tools/testing/selftests/bpf/progs/test_oom.c
>> new file mode 100644
>> index 000000000000..ca83563fc9a8
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/progs/test_oom.c
>> @@ -0,0 +1,108 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +#include "vmlinux.h"
>> +#include <bpf/bpf_helpers.h>
>> +#include <bpf/bpf_tracing.h>
>> +
>> +char _license[] SEC("license") = "GPL";
>> +
>> +#define OOM_SCORE_ADJ_MIN      (-1000)
>> +
>> +void bpf_rcu_read_lock(void) __ksym;
>> +void bpf_rcu_read_unlock(void) __ksym;
>> +struct task_struct *bpf_task_acquire(struct task_struct *p) __ksym;
>> +void bpf_task_release(struct task_struct *p) __ksym;
>> +struct mem_cgroup *bpf_get_root_mem_cgroup(void) __ksym;
>> +struct mem_cgroup *bpf_get_mem_cgroup(struct cgroup_subsys_state *css) __ksym;
>> +void bpf_put_mem_cgroup(struct mem_cgroup *memcg) __ksym;
>> +int bpf_oom_kill_process(struct oom_control *oc, struct task_struct *task,
>> +                        const char *message__str) __ksym;
>
> These declarations should come from vmlinux.h, if you don't get them,
> you might not have recent enough pahole.

Indeed. Fixed, thanks!


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 01/14] mm: introduce bpf struct ops for OOM handling
  2025-08-20 11:28   ` Kumar Kartikeya Dwivedi
@ 2025-08-21  0:24     ` Roman Gushchin
  2025-08-21  0:36       ` Kumar Kartikeya Dwivedi
  2025-08-22 19:27       ` Martin KaFai Lau
  0 siblings, 2 replies; 67+ messages in thread
From: Roman Gushchin @ 2025-08-21  0:24 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: linux-mm, bpf, Suren Baghdasaryan, Johannes Weiner, Michal Hocko,
	David Rientjes, Matt Bobrowski, Song Liu, Alexei Starovoitov,
	Andrew Morton, linux-kernel

Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:

> On Mon, 18 Aug 2025 at 19:01, Roman Gushchin <roman.gushchin@linux.dev> wrote:
>>
>> Introduce a bpf struct ops for implementing custom OOM handling policies.
>>
>> The struct ops provides the bpf_handle_out_of_memory() callback,
>> which expected to return 1 if it was able to free some memory and 0
>> otherwise.
>>
>> In the latter case it's guaranteed that the in-kernel OOM killer will
>> be invoked. Otherwise the kernel also checks the bpf_memory_freed
>> field of the oom_control structure, which is expected to be set by
>> kfuncs suitable for releasing memory. It's a safety mechanism which
>> prevents a bpf program to claim forward progress without actually
>> releasing memory. The callback program is sleepable to enable using
>> iterators, e.g. cgroup iterators.
>>
>> The callback receives struct oom_control as an argument, so it can
>> easily filter out OOM's it doesn't want to handle, e.g. global vs
>> memcg OOM's.
>>
>> The callback is executed just before the kernel victim task selection
>> algorithm, so all heuristics and sysctls like panic on oom,
>> sysctl_oom_kill_allocating_task and sysctl_oom_kill_allocating_task
>> are respected.
>>
>> The struct ops also has the name field, which allows to define a
>> custom name for the implemented policy. It's printed in the OOM report
>> in the oom_policy=<policy> format. "default" is printed if bpf is not
>> used or policy name is not specified.
>>
>> [  112.696676] test_progs invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
>>                oom_policy=bpf_test_policy
>> [  112.698160] CPU: 1 UID: 0 PID: 660 Comm: test_progs Not tainted 6.16.0-00015-gf09eb0d6badc #102 PREEMPT(full)
>> [  112.698165] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-5.fc42 04/01/2014
>> [  112.698167] Call Trace:
>> [  112.698177]  <TASK>
>> [  112.698182]  dump_stack_lvl+0x4d/0x70
>> [  112.698192]  dump_header+0x59/0x1c6
>> [  112.698199]  oom_kill_process.cold+0x8/0xef
>> [  112.698206]  bpf_oom_kill_process+0x59/0xb0
>> [  112.698216]  bpf_prog_7ecad0f36a167fd7_test_out_of_memory+0x2be/0x313
>> [  112.698229]  bpf__bpf_oom_ops_handle_out_of_memory+0x47/0xaf
>> [  112.698236]  ? srso_alias_return_thunk+0x5/0xfbef5
>> [  112.698240]  bpf_handle_oom+0x11a/0x1e0
>> [  112.698250]  out_of_memory+0xab/0x5c0
>> [  112.698258]  mem_cgroup_out_of_memory+0xbc/0x110
>> [  112.698274]  try_charge_memcg+0x4b5/0x7e0
>> [  112.698288]  charge_memcg+0x2f/0xc0
>> [  112.698293]  __mem_cgroup_charge+0x30/0xc0
>> [  112.698299]  do_anonymous_page+0x40f/0xa50
>> [  112.698311]  __handle_mm_fault+0xbba/0x1140
>> [  112.698317]  ? srso_alias_return_thunk+0x5/0xfbef5
>> [  112.698335]  handle_mm_fault+0xe6/0x370
>> [  112.698343]  do_user_addr_fault+0x211/0x6a0
>> [  112.698354]  exc_page_fault+0x75/0x1d0
>> [  112.698363]  asm_exc_page_fault+0x26/0x30
>> [  112.698366] RIP: 0033:0x7fa97236db00
>>
>> It's possible to load multiple bpf struct programs. In the case of
>> oom, they will be executed one by one in the same order they been
>> loaded until one of them returns 1 and bpf_memory_freed is set to 1
>> - an indication that the memory was freed. This allows to have
>> multiple bpf programs to focus on different types of OOM's - e.g.
>> one program can only handle memcg OOM's in one memory cgroup.
>> But the filtering is done in bpf - so it's fully flexible.
>
> I think a natural question here is ordering. Is this ability to have
> multiple OOM programs critical right now?

Good question. Initially I had only supported a single bpf policy.
But then I realized that likely people would want to have different
policies handling different parts of the cgroup tree.
E.g. a global policy and several policies handling OOMs only
in some memory cgroups.
So having just a single policy is likely a no go.

> How is it decided who gets to run before the other? Is it based on
> order of attachment (which can be non-deterministic)?

Yeah, now it's the order of attachment.

> There was a lot of discussion on something similar for tc progs, and
> we went with specific flags that capture partial ordering constraints
> (instead of priorities that may collide).
> https://lore.kernel.org/all/20230719140858.13224-2-daniel@iogearbox.net
> It would be nice if we can find a way of making this consistent.

I'll take a look, thanks!

I hope that my naive approach might be good enough for the start
and we can implement something more sophisticated later, but maybe
I'm wrong here.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 13/14] sched: psi: implement bpf_psi_create_trigger() kfunc
  2025-08-20 20:30   ` Andrii Nakryiko
@ 2025-08-21  0:36     ` Roman Gushchin
  2025-08-22 19:13       ` Andrii Nakryiko
  2025-08-22 19:57       ` Martin KaFai Lau
  0 siblings, 2 replies; 67+ messages in thread
From: Roman Gushchin @ 2025-08-21  0:36 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: linux-mm, bpf, Suren Baghdasaryan, Johannes Weiner, Michal Hocko,
	David Rientjes, Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel

Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Mon, Aug 18, 2025 at 10:06 AM Roman Gushchin
> <roman.gushchin@linux.dev> wrote:
>>
>> Implement a new bpf_psi_create_trigger() bpf kfunc, which allows
>> to create new psi triggers and attach them to cgroups or be
>> system-wide.
>>
>> Created triggers will exist until the struct ops is loaded and
>> if they are attached to a cgroup until the cgroup exists.
>>
>> Due to a limitation of 5 arguments, the resource type and the "full"
>> bit are squeezed into a single u32.
>>
>> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
>> ---
>>  kernel/sched/bpf_psi.c | 84 ++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 84 insertions(+)
>>
>> diff --git a/kernel/sched/bpf_psi.c b/kernel/sched/bpf_psi.c
>> index 2ea9d7276b21..94b684221708 100644
>> --- a/kernel/sched/bpf_psi.c
>> +++ b/kernel/sched/bpf_psi.c
>> @@ -156,6 +156,83 @@ static const struct bpf_verifier_ops bpf_psi_verifier_ops = {
>>         .is_valid_access = bpf_psi_ops_is_valid_access,
>>  };
>>
>> +__bpf_kfunc_start_defs();
>> +
>> +/**
>> + * bpf_psi_create_trigger - Create a PSI trigger
>> + * @bpf_psi: bpf_psi struct to attach the trigger to
>> + * @cgroup_id: cgroup Id to attach the trigger; 0 for system-wide scope
>> + * @resource: resource to monitor (PSI_MEM, PSI_IO, etc) and the full bit.
>> + * @threshold_us: threshold in us
>> + * @window_us: window in us
>> + *
>> + * Creates a PSI trigger and attached is to bpf_psi. The trigger will be
>> + * active unless bpf struct ops is unloaded or the corresponding cgroup
>> + * is deleted.
>> + *
>> + * Resource's most significant bit encodes whether "some" or "full"
>> + * PSI state should be tracked.
>> + *
>> + * Returns 0 on success and the error code on failure.
>> + */
>> +__bpf_kfunc int bpf_psi_create_trigger(struct bpf_psi *bpf_psi,
>> +                                      u64 cgroup_id, u32 resource,
>> +                                      u32 threshold_us, u32 window_us)
>> +{
>> +       enum psi_res res = resource & ~BPF_PSI_FULL;
>> +       bool full = resource & BPF_PSI_FULL;
>> +       struct psi_trigger_params params;
>> +       struct cgroup *cgroup __maybe_unused = NULL;
>> +       struct psi_group *group;
>> +       struct psi_trigger *t;
>> +       int ret = 0;
>> +
>> +       if (res >= NR_PSI_RESOURCES)
>> +               return -EINVAL;
>> +
>> +#ifdef CONFIG_CGROUPS
>> +       if (cgroup_id) {
>> +               cgroup = cgroup_get_from_id(cgroup_id);
>> +               if (IS_ERR_OR_NULL(cgroup))
>> +                       return PTR_ERR(cgroup);
>> +
>> +               group = cgroup_psi(cgroup);
>> +       } else
>> +#endif
>> +               group = &psi_system;
>
> just a drive-by comment while skimming through the patch set: can't
> you use IS_ENABLED(CONFIG_CGROUPS) and have a proper if/else with
> proper {} ?

Fixed.
It required defining cgroup_get_from_id() and cgroup_psi()
for !CONFIG_CGROUPS, but I agree, it's much better.
Thanks

>
>> +
>> +       params.type = PSI_BPF;
>> +       params.bpf_psi = bpf_psi;
>> +       params.privileged = capable(CAP_SYS_RESOURCE);
>> +       params.res = res;
>> +       params.full = full;
>> +       params.threshold_us = threshold_us;
>> +       params.window_us = window_us;
>> +
>> +       t = psi_trigger_create(group, &params);
>> +       if (IS_ERR(t))
>> +               ret = PTR_ERR(t);
>> +       else
>> +               t->cgroup_id = cgroup_id;
>> +
>> +#ifdef CONFIG_CGROUPS
>> +       if (cgroup)
>> +               cgroup_put(cgroup);
>> +#endif
>> +
>> +       return ret;
>> +}
>> +__bpf_kfunc_end_defs();
>> +
>> +BTF_KFUNCS_START(bpf_psi_kfuncs)
>> +BTF_ID_FLAGS(func, bpf_psi_create_trigger, KF_TRUSTED_ARGS)
>> +BTF_KFUNCS_END(bpf_psi_kfuncs)
>> +
>> +static const struct btf_kfunc_id_set bpf_psi_kfunc_set = {
>> +       .owner          = THIS_MODULE,
>> +       .set            = &bpf_psi_kfuncs,
>> +};
>> +
>>  static int bpf_psi_ops_reg(void *kdata, struct bpf_link *link)
>>  {
>>         struct bpf_psi_ops *ops = kdata;
>> @@ -238,6 +315,13 @@ static int __init bpf_psi_struct_ops_init(void)
>>         if (!bpf_psi_wq)
>>                 return -ENOMEM;
>>
>> +       err = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
>> +                                       &bpf_psi_kfunc_set);
>
> would this make kfunc callable from any struct_ops, not just this psi
> one?

It will. Idk how big of a problem it is, given that the caller needs
a trusted reference to bpf_psi. Also, is there a simple way to constrain
it? Wdyt?


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 01/14] mm: introduce bpf struct ops for OOM handling
  2025-08-21  0:24     ` Roman Gushchin
@ 2025-08-21  0:36       ` Kumar Kartikeya Dwivedi
  2025-08-21  2:22         ` Roman Gushchin
  2025-08-22 19:27       ` Martin KaFai Lau
  1 sibling, 1 reply; 67+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2025-08-21  0:36 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, bpf, Suren Baghdasaryan, Johannes Weiner, Michal Hocko,
	David Rientjes, Matt Bobrowski, Song Liu, Alexei Starovoitov,
	Andrew Morton, linux-kernel

On Thu, 21 Aug 2025 at 02:25, Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:
>
> > On Mon, 18 Aug 2025 at 19:01, Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >>
> >> Introduce a bpf struct ops for implementing custom OOM handling policies.
> >>
> >> The struct ops provides the bpf_handle_out_of_memory() callback,
> >> which expected to return 1 if it was able to free some memory and 0
> >> otherwise.
> >>
> >> In the latter case it's guaranteed that the in-kernel OOM killer will
> >> be invoked. Otherwise the kernel also checks the bpf_memory_freed
> >> field of the oom_control structure, which is expected to be set by
> >> kfuncs suitable for releasing memory. It's a safety mechanism which
> >> prevents a bpf program to claim forward progress without actually
> >> releasing memory. The callback program is sleepable to enable using
> >> iterators, e.g. cgroup iterators.
> >>
> >> The callback receives struct oom_control as an argument, so it can
> >> easily filter out OOM's it doesn't want to handle, e.g. global vs
> >> memcg OOM's.
> >>
> >> The callback is executed just before the kernel victim task selection
> >> algorithm, so all heuristics and sysctls like panic on oom,
> >> sysctl_oom_kill_allocating_task and sysctl_oom_kill_allocating_task
> >> are respected.
> >>
> >> The struct ops also has the name field, which allows to define a
> >> custom name for the implemented policy. It's printed in the OOM report
> >> in the oom_policy=<policy> format. "default" is printed if bpf is not
> >> used or policy name is not specified.
> >>
> >> [  112.696676] test_progs invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
> >>                oom_policy=bpf_test_policy
> >> [  112.698160] CPU: 1 UID: 0 PID: 660 Comm: test_progs Not tainted 6.16.0-00015-gf09eb0d6badc #102 PREEMPT(full)
> >> [  112.698165] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-5.fc42 04/01/2014
> >> [  112.698167] Call Trace:
> >> [  112.698177]  <TASK>
> >> [  112.698182]  dump_stack_lvl+0x4d/0x70
> >> [  112.698192]  dump_header+0x59/0x1c6
> >> [  112.698199]  oom_kill_process.cold+0x8/0xef
> >> [  112.698206]  bpf_oom_kill_process+0x59/0xb0
> >> [  112.698216]  bpf_prog_7ecad0f36a167fd7_test_out_of_memory+0x2be/0x313
> >> [  112.698229]  bpf__bpf_oom_ops_handle_out_of_memory+0x47/0xaf
> >> [  112.698236]  ? srso_alias_return_thunk+0x5/0xfbef5
> >> [  112.698240]  bpf_handle_oom+0x11a/0x1e0
> >> [  112.698250]  out_of_memory+0xab/0x5c0
> >> [  112.698258]  mem_cgroup_out_of_memory+0xbc/0x110
> >> [  112.698274]  try_charge_memcg+0x4b5/0x7e0
> >> [  112.698288]  charge_memcg+0x2f/0xc0
> >> [  112.698293]  __mem_cgroup_charge+0x30/0xc0
> >> [  112.698299]  do_anonymous_page+0x40f/0xa50
> >> [  112.698311]  __handle_mm_fault+0xbba/0x1140
> >> [  112.698317]  ? srso_alias_return_thunk+0x5/0xfbef5
> >> [  112.698335]  handle_mm_fault+0xe6/0x370
> >> [  112.698343]  do_user_addr_fault+0x211/0x6a0
> >> [  112.698354]  exc_page_fault+0x75/0x1d0
> >> [  112.698363]  asm_exc_page_fault+0x26/0x30
> >> [  112.698366] RIP: 0033:0x7fa97236db00
> >>
> >> It's possible to load multiple bpf struct programs. In the case of
> >> oom, they will be executed one by one in the same order they been
> >> loaded until one of them returns 1 and bpf_memory_freed is set to 1
> >> - an indication that the memory was freed. This allows to have
> >> multiple bpf programs to focus on different types of OOM's - e.g.
> >> one program can only handle memcg OOM's in one memory cgroup.
> >> But the filtering is done in bpf - so it's fully flexible.
> >
> > I think a natural question here is ordering. Is this ability to have
> > multiple OOM programs critical right now?
>
> Good question. Initially I had only supported a single bpf policy.
> But then I realized that likely people would want to have different
> policies handling different parts of the cgroup tree.
> E.g. a global policy and several policies handling OOMs only
> in some memory cgroups.
> So having just a single policy is likely a no go.

If the ordering is more to facilitate scoping, would it then be better
to support attaching the policy to specific memcg/cgroup?
There is then one global policy if need be (by attaching to root), but
descendants can have their own which takes precedence, if it doesn't
act, we walk up the hierarchy and find the next handler in the parent
cgroup etc. all the way to the root until one of them returns 1.

>
> [...]


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 01/14] mm: introduce bpf struct ops for OOM handling
  2025-08-21  0:36       ` Kumar Kartikeya Dwivedi
@ 2025-08-21  2:22         ` Roman Gushchin
  2025-08-21 15:54           ` Suren Baghdasaryan
  0 siblings, 1 reply; 67+ messages in thread
From: Roman Gushchin @ 2025-08-21  2:22 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: linux-mm, bpf, Suren Baghdasaryan, Johannes Weiner, Michal Hocko,
	David Rientjes, Matt Bobrowski, Song Liu, Alexei Starovoitov,
	Andrew Morton, linux-kernel

Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:

> On Thu, 21 Aug 2025 at 02:25, Roman Gushchin <roman.gushchin@linux.dev> wrote:
>>
>> Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:
>>
>> > On Mon, 18 Aug 2025 at 19:01, Roman Gushchin <roman.gushchin@linux.dev> wrote:
>> >>
>> >> Introduce a bpf struct ops for implementing custom OOM handling policies.
>> >>
>> >> The struct ops provides the bpf_handle_out_of_memory() callback,
>> >> which expected to return 1 if it was able to free some memory and 0
>> >> otherwise.
>> >>
>> >> In the latter case it's guaranteed that the in-kernel OOM killer will
>> >> be invoked. Otherwise the kernel also checks the bpf_memory_freed
>> >> field of the oom_control structure, which is expected to be set by
>> >> kfuncs suitable for releasing memory. It's a safety mechanism which
>> >> prevents a bpf program to claim forward progress without actually
>> >> releasing memory. The callback program is sleepable to enable using
>> >> iterators, e.g. cgroup iterators.
>> >>
>> >> The callback receives struct oom_control as an argument, so it can
>> >> easily filter out OOM's it doesn't want to handle, e.g. global vs
>> >> memcg OOM's.
>> >>
>> >> The callback is executed just before the kernel victim task selection
>> >> algorithm, so all heuristics and sysctls like panic on oom,
>> >> sysctl_oom_kill_allocating_task and sysctl_oom_kill_allocating_task
>> >> are respected.
>> >>
>> >> The struct ops also has the name field, which allows to define a
>> >> custom name for the implemented policy. It's printed in the OOM report
>> >> in the oom_policy=<policy> format. "default" is printed if bpf is not
>> >> used or policy name is not specified.
>> >>
>> >> [  112.696676] test_progs invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
>> >>                oom_policy=bpf_test_policy
>> >> [  112.698160] CPU: 1 UID: 0 PID: 660 Comm: test_progs Not tainted 6.16.0-00015-gf09eb0d6badc #102 PREEMPT(full)
>> >> [  112.698165] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-5.fc42 04/01/2014
>> >> [  112.698167] Call Trace:
>> >> [  112.698177]  <TASK>
>> >> [  112.698182]  dump_stack_lvl+0x4d/0x70
>> >> [  112.698192]  dump_header+0x59/0x1c6
>> >> [  112.698199]  oom_kill_process.cold+0x8/0xef
>> >> [  112.698206]  bpf_oom_kill_process+0x59/0xb0
>> >> [  112.698216]  bpf_prog_7ecad0f36a167fd7_test_out_of_memory+0x2be/0x313
>> >> [  112.698229]  bpf__bpf_oom_ops_handle_out_of_memory+0x47/0xaf
>> >> [  112.698236]  ? srso_alias_return_thunk+0x5/0xfbef5
>> >> [  112.698240]  bpf_handle_oom+0x11a/0x1e0
>> >> [  112.698250]  out_of_memory+0xab/0x5c0
>> >> [  112.698258]  mem_cgroup_out_of_memory+0xbc/0x110
>> >> [  112.698274]  try_charge_memcg+0x4b5/0x7e0
>> >> [  112.698288]  charge_memcg+0x2f/0xc0
>> >> [  112.698293]  __mem_cgroup_charge+0x30/0xc0
>> >> [  112.698299]  do_anonymous_page+0x40f/0xa50
>> >> [  112.698311]  __handle_mm_fault+0xbba/0x1140
>> >> [  112.698317]  ? srso_alias_return_thunk+0x5/0xfbef5
>> >> [  112.698335]  handle_mm_fault+0xe6/0x370
>> >> [  112.698343]  do_user_addr_fault+0x211/0x6a0
>> >> [  112.698354]  exc_page_fault+0x75/0x1d0
>> >> [  112.698363]  asm_exc_page_fault+0x26/0x30
>> >> [  112.698366] RIP: 0033:0x7fa97236db00
>> >>
>> >> It's possible to load multiple bpf struct programs. In the case of
>> >> oom, they will be executed one by one in the same order they been
>> >> loaded until one of them returns 1 and bpf_memory_freed is set to 1
>> >> - an indication that the memory was freed. This allows to have
>> >> multiple bpf programs to focus on different types of OOM's - e.g.
>> >> one program can only handle memcg OOM's in one memory cgroup.
>> >> But the filtering is done in bpf - so it's fully flexible.
>> >
>> > I think a natural question here is ordering. Is this ability to have
>> > multiple OOM programs critical right now?
>>
>> Good question. Initially I had only supported a single bpf policy.
>> But then I realized that likely people would want to have different
>> policies handling different parts of the cgroup tree.
>> E.g. a global policy and several policies handling OOMs only
>> in some memory cgroups.
>> So having just a single policy is likely a no go.
>
> If the ordering is more to facilitate scoping, would it then be better
> to support attaching the policy to specific memcg/cgroup?

Well, it has some advantages and disadvantages. First, it will require
way more infrastructure on the memcg side. Second, the interface is not
super clear: we don't want to have a struct ops per cgroup, I guess.
And in many case a single policy for all memcgs is just fine, so asking
the user to attach it to all memcgs is just adding a toil and creating
all kinds of races.
So I see your point, but I'm not yet convinced, to be honest.

Thanks!


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 01/14] mm: introduce bpf struct ops for OOM handling
  2025-08-21  2:22         ` Roman Gushchin
@ 2025-08-21 15:54           ` Suren Baghdasaryan
  0 siblings, 0 replies; 67+ messages in thread
From: Suren Baghdasaryan @ 2025-08-21 15:54 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Kumar Kartikeya Dwivedi, linux-mm, bpf, Johannes Weiner,
	Michal Hocko, David Rientjes, Matt Bobrowski, Song Liu,
	Alexei Starovoitov, Andrew Morton, linux-kernel

On Wed, Aug 20, 2025 at 7:22 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:
>
> > On Thu, 21 Aug 2025 at 02:25, Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >>
> >> Kumar Kartikeya Dwivedi <memxor@gmail.com> writes:
> >>
> >> > On Mon, 18 Aug 2025 at 19:01, Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >> >>
> >> >> Introduce a bpf struct ops for implementing custom OOM handling policies.
> >> >>
> >> >> The struct ops provides the bpf_handle_out_of_memory() callback,
> >> >> which expected to return 1 if it was able to free some memory and 0
> >> >> otherwise.
> >> >>
> >> >> In the latter case it's guaranteed that the in-kernel OOM killer will
> >> >> be invoked. Otherwise the kernel also checks the bpf_memory_freed
> >> >> field of the oom_control structure, which is expected to be set by
> >> >> kfuncs suitable for releasing memory. It's a safety mechanism which
> >> >> prevents a bpf program to claim forward progress without actually
> >> >> releasing memory. The callback program is sleepable to enable using
> >> >> iterators, e.g. cgroup iterators.
> >> >>
> >> >> The callback receives struct oom_control as an argument, so it can
> >> >> easily filter out OOM's it doesn't want to handle, e.g. global vs
> >> >> memcg OOM's.
> >> >>
> >> >> The callback is executed just before the kernel victim task selection
> >> >> algorithm, so all heuristics and sysctls like panic on oom,
> >> >> sysctl_oom_kill_allocating_task and sysctl_oom_kill_allocating_task
> >> >> are respected.
> >> >>
> >> >> The struct ops also has the name field, which allows to define a
> >> >> custom name for the implemented policy. It's printed in the OOM report
> >> >> in the oom_policy=<policy> format. "default" is printed if bpf is not
> >> >> used or policy name is not specified.
> >> >>
> >> >> [  112.696676] test_progs invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
> >> >>                oom_policy=bpf_test_policy
> >> >> [  112.698160] CPU: 1 UID: 0 PID: 660 Comm: test_progs Not tainted 6.16.0-00015-gf09eb0d6badc #102 PREEMPT(full)
> >> >> [  112.698165] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-5.fc42 04/01/2014
> >> >> [  112.698167] Call Trace:
> >> >> [  112.698177]  <TASK>
> >> >> [  112.698182]  dump_stack_lvl+0x4d/0x70
> >> >> [  112.698192]  dump_header+0x59/0x1c6
> >> >> [  112.698199]  oom_kill_process.cold+0x8/0xef
> >> >> [  112.698206]  bpf_oom_kill_process+0x59/0xb0
> >> >> [  112.698216]  bpf_prog_7ecad0f36a167fd7_test_out_of_memory+0x2be/0x313
> >> >> [  112.698229]  bpf__bpf_oom_ops_handle_out_of_memory+0x47/0xaf
> >> >> [  112.698236]  ? srso_alias_return_thunk+0x5/0xfbef5
> >> >> [  112.698240]  bpf_handle_oom+0x11a/0x1e0
> >> >> [  112.698250]  out_of_memory+0xab/0x5c0
> >> >> [  112.698258]  mem_cgroup_out_of_memory+0xbc/0x110
> >> >> [  112.698274]  try_charge_memcg+0x4b5/0x7e0
> >> >> [  112.698288]  charge_memcg+0x2f/0xc0
> >> >> [  112.698293]  __mem_cgroup_charge+0x30/0xc0
> >> >> [  112.698299]  do_anonymous_page+0x40f/0xa50
> >> >> [  112.698311]  __handle_mm_fault+0xbba/0x1140
> >> >> [  112.698317]  ? srso_alias_return_thunk+0x5/0xfbef5
> >> >> [  112.698335]  handle_mm_fault+0xe6/0x370
> >> >> [  112.698343]  do_user_addr_fault+0x211/0x6a0
> >> >> [  112.698354]  exc_page_fault+0x75/0x1d0
> >> >> [  112.698363]  asm_exc_page_fault+0x26/0x30
> >> >> [  112.698366] RIP: 0033:0x7fa97236db00
> >> >>
> >> >> It's possible to load multiple bpf struct programs. In the case of
> >> >> oom, they will be executed one by one in the same order they been
> >> >> loaded until one of them returns 1 and bpf_memory_freed is set to 1
> >> >> - an indication that the memory was freed. This allows to have
> >> >> multiple bpf programs to focus on different types of OOM's - e.g.
> >> >> one program can only handle memcg OOM's in one memory cgroup.
> >> >> But the filtering is done in bpf - so it's fully flexible.
> >> >
> >> > I think a natural question here is ordering. Is this ability to have
> >> > multiple OOM programs critical right now?
> >>
> >> Good question. Initially I had only supported a single bpf policy.
> >> But then I realized that likely people would want to have different
> >> policies handling different parts of the cgroup tree.
> >> E.g. a global policy and several policies handling OOMs only
> >> in some memory cgroups.
> >> So having just a single policy is likely a no go.
> >
> > If the ordering is more to facilitate scoping, would it then be better
> > to support attaching the policy to specific memcg/cgroup?
>
> Well, it has some advantages and disadvantages. First, it will require
> way more infrastructure on the memcg side. Second, the interface is not
> super clear: we don't want to have a struct ops per cgroup, I guess.
> And in many case a single policy for all memcgs is just fine, so asking
> the user to attach it to all memcgs is just adding a toil and creating
> all kinds of races.
> So I see your point, but I'm not yet convinced, to be honest.

I would suggest keeping it simple until we know there is a need to
prioritize between multiple oom-killers.

>
> Thanks!
>


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 13/14] sched: psi: implement bpf_psi_create_trigger() kfunc
  2025-08-21  0:36     ` Roman Gushchin
@ 2025-08-22 19:13       ` Andrii Nakryiko
  2025-08-22 19:57       ` Martin KaFai Lau
  1 sibling, 0 replies; 67+ messages in thread
From: Andrii Nakryiko @ 2025-08-22 19:13 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-mm, bpf, Suren Baghdasaryan, Johannes Weiner, Michal Hocko,
	David Rientjes, Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel

On Wed, Aug 20, 2025 at 5:36 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
> > On Mon, Aug 18, 2025 at 10:06 AM Roman Gushchin
> > <roman.gushchin@linux.dev> wrote:
> >>
> >> Implement a new bpf_psi_create_trigger() bpf kfunc, which allows
> >> to create new psi triggers and attach them to cgroups or be
> >> system-wide.
> >>
> >> Created triggers will exist until the struct ops is loaded and
> >> if they are attached to a cgroup until the cgroup exists.
> >>
> >> Due to a limitation of 5 arguments, the resource type and the "full"
> >> bit are squeezed into a single u32.
> >>
> >> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> >> ---
> >>  kernel/sched/bpf_psi.c | 84 ++++++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 84 insertions(+)
> >>
> >> diff --git a/kernel/sched/bpf_psi.c b/kernel/sched/bpf_psi.c
> >> index 2ea9d7276b21..94b684221708 100644
> >> --- a/kernel/sched/bpf_psi.c
> >> +++ b/kernel/sched/bpf_psi.c
> >> @@ -156,6 +156,83 @@ static const struct bpf_verifier_ops bpf_psi_verifier_ops = {
> >>         .is_valid_access = bpf_psi_ops_is_valid_access,
> >>  };
> >>
> >> +__bpf_kfunc_start_defs();
> >> +
> >> +/**
> >> + * bpf_psi_create_trigger - Create a PSI trigger
> >> + * @bpf_psi: bpf_psi struct to attach the trigger to
> >> + * @cgroup_id: cgroup Id to attach the trigger; 0 for system-wide scope
> >> + * @resource: resource to monitor (PSI_MEM, PSI_IO, etc) and the full bit.
> >> + * @threshold_us: threshold in us
> >> + * @window_us: window in us
> >> + *
> >> + * Creates a PSI trigger and attached is to bpf_psi. The trigger will be
> >> + * active unless bpf struct ops is unloaded or the corresponding cgroup
> >> + * is deleted.
> >> + *
> >> + * Resource's most significant bit encodes whether "some" or "full"
> >> + * PSI state should be tracked.
> >> + *
> >> + * Returns 0 on success and the error code on failure.
> >> + */
> >> +__bpf_kfunc int bpf_psi_create_trigger(struct bpf_psi *bpf_psi,
> >> +                                      u64 cgroup_id, u32 resource,
> >> +                                      u32 threshold_us, u32 window_us)
> >> +{
> >> +       enum psi_res res = resource & ~BPF_PSI_FULL;
> >> +       bool full = resource & BPF_PSI_FULL;
> >> +       struct psi_trigger_params params;
> >> +       struct cgroup *cgroup __maybe_unused = NULL;
> >> +       struct psi_group *group;
> >> +       struct psi_trigger *t;
> >> +       int ret = 0;
> >> +
> >> +       if (res >= NR_PSI_RESOURCES)
> >> +               return -EINVAL;
> >> +
> >> +#ifdef CONFIG_CGROUPS
> >> +       if (cgroup_id) {
> >> +               cgroup = cgroup_get_from_id(cgroup_id);
> >> +               if (IS_ERR_OR_NULL(cgroup))
> >> +                       return PTR_ERR(cgroup);
> >> +
> >> +               group = cgroup_psi(cgroup);
> >> +       } else
> >> +#endif
> >> +               group = &psi_system;
> >
> > just a drive-by comment while skimming through the patch set: can't
> > you use IS_ENABLED(CONFIG_CGROUPS) and have a proper if/else with
> > proper {} ?
>
> Fixed.
> It required defining cgroup_get_from_id() and cgroup_psi()
> for !CONFIG_CGROUPS, but I agree, it's much better.
> Thanks
>
> >
> >> +
> >> +       params.type = PSI_BPF;
> >> +       params.bpf_psi = bpf_psi;
> >> +       params.privileged = capable(CAP_SYS_RESOURCE);
> >> +       params.res = res;
> >> +       params.full = full;
> >> +       params.threshold_us = threshold_us;
> >> +       params.window_us = window_us;
> >> +
> >> +       t = psi_trigger_create(group, &params);
> >> +       if (IS_ERR(t))
> >> +               ret = PTR_ERR(t);
> >> +       else
> >> +               t->cgroup_id = cgroup_id;
> >> +
> >> +#ifdef CONFIG_CGROUPS
> >> +       if (cgroup)
> >> +               cgroup_put(cgroup);
> >> +#endif
> >> +
> >> +       return ret;
> >> +}
> >> +__bpf_kfunc_end_defs();
> >> +
> >> +BTF_KFUNCS_START(bpf_psi_kfuncs)
> >> +BTF_ID_FLAGS(func, bpf_psi_create_trigger, KF_TRUSTED_ARGS)
> >> +BTF_KFUNCS_END(bpf_psi_kfuncs)
> >> +
> >> +static const struct btf_kfunc_id_set bpf_psi_kfunc_set = {
> >> +       .owner          = THIS_MODULE,
> >> +       .set            = &bpf_psi_kfuncs,
> >> +};
> >> +
> >>  static int bpf_psi_ops_reg(void *kdata, struct bpf_link *link)
> >>  {
> >>         struct bpf_psi_ops *ops = kdata;
> >> @@ -238,6 +315,13 @@ static int __init bpf_psi_struct_ops_init(void)
> >>         if (!bpf_psi_wq)
> >>                 return -ENOMEM;
> >>
> >> +       err = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
> >> +                                       &bpf_psi_kfunc_set);
> >
> > would this make kfunc callable from any struct_ops, not just this psi
> > one?
>
> It will. Idk how big of a problem it is, given that the caller needs
> a trusted reference to bpf_psi.

Yes, I agree, probably not a big deal.

> Also, is there a simple way to constrain it? Wdyt?

We've talked about having the ability to restrict kfuncs to specific
struct_ops types, but I don't think we've ever made much progress on
this. So no, I don't think there is a simple way.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 01/14] mm: introduce bpf struct ops for OOM handling
  2025-08-21  0:24     ` Roman Gushchin
  2025-08-21  0:36       ` Kumar Kartikeya Dwivedi
@ 2025-08-22 19:27       ` Martin KaFai Lau
  2025-08-25 17:00         ` Roman Gushchin
  1 sibling, 1 reply; 67+ messages in thread
From: Martin KaFai Lau @ 2025-08-22 19:27 UTC (permalink / raw)
  To: Roman Gushchin, Kumar Kartikeya Dwivedi
  Cc: linux-mm, bpf, Suren Baghdasaryan, Johannes Weiner, Michal Hocko,
	David Rientjes, Matt Bobrowski, Song Liu, Alexei Starovoitov,
	Andrew Morton, linux-kernel

On 8/20/25 5:24 PM, Roman Gushchin wrote:
>> How is it decided who gets to run before the other? Is it based on
>> order of attachment (which can be non-deterministic)?
> Yeah, now it's the order of attachment.
> 
>> There was a lot of discussion on something similar for tc progs, and
>> we went with specific flags that capture partial ordering constraints
>> (instead of priorities that may collide).
>> https://lore.kernel.org/all/20230719140858.13224-2-daniel@iogearbox.net
>> It would be nice if we can find a way of making this consistent.

+1

The cgroup bpf prog has recently added the mprog api support also. If the simple 
order of attachment is not enough and needs to have specific ordering, we should 
make the bpf struct_ops support the same mprog api instead of asking each 
subsystem creating its own.

fyi, another need for struct_ops ordering is to upgrade the 
BPF_PROG_TYPE_SOCK_OPS api to struct_ops for easier extension in the future. 
Slide 13 in https://drive.google.com/file/d/1wjKZth6T0llLJ_ONPAL_6Q_jbxbAjByp/view


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 13/14] sched: psi: implement bpf_psi_create_trigger() kfunc
  2025-08-21  0:36     ` Roman Gushchin
  2025-08-22 19:13       ` Andrii Nakryiko
@ 2025-08-22 19:57       ` Martin KaFai Lau
  2025-08-25 16:56         ` Roman Gushchin
  1 sibling, 1 reply; 67+ messages in thread
From: Martin KaFai Lau @ 2025-08-22 19:57 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrii Nakryiko, linux-mm, bpf, Suren Baghdasaryan,
	Johannes Weiner, Michal Hocko, David Rientjes, Matt Bobrowski,
	Song Liu, Kumar Kartikeya Dwivedi, Alexei Starovoitov,
	Andrew Morton, linux-kernel

On 8/20/25 5:36 PM, Roman Gushchin wrote:
> It will. Idk how big of a problem it is, given that the caller needs
> a trusted reference to bpf_psi. Also, is there a simple way to constrain
> it? Wdyt?

The bpf qdisc has the kfunc filtering. Take a look at the bpf_qdisc_kfunc_filter 
in bpf_qdisc.c.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 13/14] sched: psi: implement bpf_psi_create_trigger() kfunc
  2025-08-22 19:57       ` Martin KaFai Lau
@ 2025-08-25 16:56         ` Roman Gushchin
  0 siblings, 0 replies; 67+ messages in thread
From: Roman Gushchin @ 2025-08-25 16:56 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Andrii Nakryiko, linux-mm, bpf, Suren Baghdasaryan,
	Johannes Weiner, Michal Hocko, David Rientjes, Matt Bobrowski,
	Song Liu, Kumar Kartikeya Dwivedi, Alexei Starovoitov,
	Andrew Morton, linux-kernel

Martin KaFai Lau <martin.lau@linux.dev> writes:

> On 8/20/25 5:36 PM, Roman Gushchin wrote:
>> It will. Idk how big of a problem it is, given that the caller needs
>> a trusted reference to bpf_psi. Also, is there a simple way to constrain
>> it? Wdyt?
>
> The bpf qdisc has the kfunc filtering. Take a look at the
> bpf_qdisc_kfunc_filter in bpf_qdisc.c.

Thanks! I'll take a look.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 01/14] mm: introduce bpf struct ops for OOM handling
  2025-08-22 19:27       ` Martin KaFai Lau
@ 2025-08-25 17:00         ` Roman Gushchin
  2025-08-26 18:01           ` Martin KaFai Lau
  0 siblings, 1 reply; 67+ messages in thread
From: Roman Gushchin @ 2025-08-25 17:00 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Kumar Kartikeya Dwivedi, linux-mm, bpf, Suren Baghdasaryan,
	Johannes Weiner, Michal Hocko, David Rientjes, Matt Bobrowski,
	Song Liu, Alexei Starovoitov, Andrew Morton, linux-kernel

Martin KaFai Lau <martin.lau@linux.dev> writes:

> On 8/20/25 5:24 PM, Roman Gushchin wrote:
>>> How is it decided who gets to run before the other? Is it based on
>>> order of attachment (which can be non-deterministic)?
>> Yeah, now it's the order of attachment.
>> 
>>> There was a lot of discussion on something similar for tc progs, and
>>> we went with specific flags that capture partial ordering constraints
>>> (instead of priorities that may collide).
>>> https://lore.kernel.org/all/20230719140858.13224-2-daniel@iogearbox.net
>>> It would be nice if we can find a way of making this consistent.
>
> +1
>
> The cgroup bpf prog has recently added the mprog api support also. If
> the simple order of attachment is not enough and needs to have
> specific ordering, we should make the bpf struct_ops support the same
> mprog api instead of asking each subsystem creating its own.
>
> fyi, another need for struct_ops ordering is to upgrade the
> BPF_PROG_TYPE_SOCK_OPS api to struct_ops for easier extension in the
> future. Slide 13 in
> https://drive.google.com/file/d/1wjKZth6T0llLJ_ONPAL_6Q_jbxbAjByp/view

Does it mean it's better now to keep it simple in the context of oom
patches with the plan to later reuse the generic struct_ops
infrastructure?

Honestly, I believe that the simple order of attachment should be
good enough for quite a while, so I'd not over-complicate this,
unless it's not fixable later.

Thanks!


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 01/14] mm: introduce bpf struct ops for OOM handling
  2025-08-20 19:34       ` Suren Baghdasaryan
  2025-08-20 19:52         ` Roman Gushchin
@ 2025-08-26 16:23         ` Amery Hung
  1 sibling, 0 replies; 67+ messages in thread
From: Amery Hung @ 2025-08-26 16:23 UTC (permalink / raw)
  To: Suren Baghdasaryan, Roman Gushchin
  Cc: linux-mm, bpf, Johannes Weiner, Michal Hocko, David Rientjes,
	Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel



On 8/20/25 12:34 PM, Suren Baghdasaryan wrote:
> On Tue, Aug 19, 2025 at 1:06 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>> Suren Baghdasaryan <surenb@google.com> writes:
>>
>>> On Mon, Aug 18, 2025 at 10:01 AM Roman Gushchin
>>> <roman.gushchin@linux.dev> wrote:
>>>> Introduce a bpf struct ops for implementing custom OOM handling policies.
>>>>
>>>> The struct ops provides the bpf_handle_out_of_memory() callback,
>>>> which expected to return 1 if it was able to free some memory and 0
>>>> otherwise.
>>>>
>>>> In the latter case it's guaranteed that the in-kernel OOM killer will
>>>> be invoked. Otherwise the kernel also checks the bpf_memory_freed
>>>> field of the oom_control structure, which is expected to be set by
>>>> kfuncs suitable for releasing memory. It's a safety mechanism which
>>>> prevents a bpf program to claim forward progress without actually
>>>> releasing memory. The callback program is sleepable to enable using
>>>> iterators, e.g. cgroup iterators.
>>>>
>>>> The callback receives struct oom_control as an argument, so it can
>>>> easily filter out OOM's it doesn't want to handle, e.g. global vs
>>>> memcg OOM's.
>>>>
>>>> The callback is executed just before the kernel victim task selection
>>>> algorithm, so all heuristics and sysctls like panic on oom,
>>>> sysctl_oom_kill_allocating_task and sysctl_oom_kill_allocating_task
>>>> are respected.
>>>>
>>>> The struct ops also has the name field, which allows to define a
>>>> custom name for the implemented policy. It's printed in the OOM report
>>>> in the oom_policy=<policy> format. "default" is printed if bpf is not
>>>> used or policy name is not specified.
>>>>
>>>> [  112.696676] test_progs invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
>>>>                 oom_policy=bpf_test_policy
>>>> [  112.698160] CPU: 1 UID: 0 PID: 660 Comm: test_progs Not tainted 6.16.0-00015-gf09eb0d6badc #102 PREEMPT(full)
>>>> [  112.698165] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-5.fc42 04/01/2014
>>>> [  112.698167] Call Trace:
>>>> [  112.698177]  <TASK>
>>>> [  112.698182]  dump_stack_lvl+0x4d/0x70
>>>> [  112.698192]  dump_header+0x59/0x1c6
>>>> [  112.698199]  oom_kill_process.cold+0x8/0xef
>>>> [  112.698206]  bpf_oom_kill_process+0x59/0xb0
>>>> [  112.698216]  bpf_prog_7ecad0f36a167fd7_test_out_of_memory+0x2be/0x313
>>>> [  112.698229]  bpf__bpf_oom_ops_handle_out_of_memory+0x47/0xaf
>>>> [  112.698236]  ? srso_alias_return_thunk+0x5/0xfbef5
>>>> [  112.698240]  bpf_handle_oom+0x11a/0x1e0
>>>> [  112.698250]  out_of_memory+0xab/0x5c0
>>>> [  112.698258]  mem_cgroup_out_of_memory+0xbc/0x110
>>>> [  112.698274]  try_charge_memcg+0x4b5/0x7e0
>>>> [  112.698288]  charge_memcg+0x2f/0xc0
>>>> [  112.698293]  __mem_cgroup_charge+0x30/0xc0
>>>> [  112.698299]  do_anonymous_page+0x40f/0xa50
>>>> [  112.698311]  __handle_mm_fault+0xbba/0x1140
>>>> [  112.698317]  ? srso_alias_return_thunk+0x5/0xfbef5
>>>> [  112.698335]  handle_mm_fault+0xe6/0x370
>>>> [  112.698343]  do_user_addr_fault+0x211/0x6a0
>>>> [  112.698354]  exc_page_fault+0x75/0x1d0
>>>> [  112.698363]  asm_exc_page_fault+0x26/0x30
>>>> [  112.698366] RIP: 0033:0x7fa97236db00
>>>>
>>>> It's possible to load multiple bpf struct programs. In the case of
>>>> oom, they will be executed one by one in the same order they been
>>>> loaded until one of them returns 1 and bpf_memory_freed is set to 1
>>>> - an indication that the memory was freed. This allows to have
>>>> multiple bpf programs to focus on different types of OOM's - e.g.
>>>> one program can only handle memcg OOM's in one memory cgroup.
>>>> But the filtering is done in bpf - so it's fully flexible.
>>>>
>>>> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
>>>> ---
>>>>   include/linux/bpf_oom.h |  49 +++++++++++++
>>>>   include/linux/oom.h     |   8 ++
>>>>   mm/Makefile             |   3 +
>>>>   mm/bpf_oom.c            | 157 ++++++++++++++++++++++++++++++++++++++++
>>>>   mm/oom_kill.c           |  22 +++++-
>>>>   5 files changed, 237 insertions(+), 2 deletions(-)
>>>>   create mode 100644 include/linux/bpf_oom.h
>>>>   create mode 100644 mm/bpf_oom.c
>>>>
>>>> diff --git a/include/linux/bpf_oom.h b/include/linux/bpf_oom.h
>>>> new file mode 100644
>>>> index 000000000000..29cb5ea41d97
>>>> --- /dev/null
>>>> +++ b/include/linux/bpf_oom.h
>>>> @@ -0,0 +1,49 @@
>>>> +/* SPDX-License-Identifier: GPL-2.0+ */
>>>> +
>>>> +#ifndef __BPF_OOM_H
>>>> +#define __BPF_OOM_H
>>>> +
>>>> +struct bpf_oom;
>>>> +struct oom_control;
>>>> +
>>>> +#define BPF_OOM_NAME_MAX_LEN 64
>>>> +
>>>> +struct bpf_oom_ops {
>>>> +       /**
>>>> +        * @handle_out_of_memory: Out of memory bpf handler, called before
>>>> +        * the in-kernel OOM killer.
>>>> +        * @oc: OOM control structure
>>>> +        *
>>>> +        * Should return 1 if some memory was freed up, otherwise
>>>> +        * the in-kernel OOM killer is invoked.
>>>> +        */
>>>> +       int (*handle_out_of_memory)(struct oom_control *oc);
>>>> +
>>>> +       /**
>>>> +        * @name: BPF OOM policy name
>>>> +        */
>>>> +       char name[BPF_OOM_NAME_MAX_LEN];
>>> Why should the name be a part of ops structure? IMO it's not an
>>> attribute of the operations but rather of the oom handler which is
>>> represented by bpf_oom here.
>> The ops structure describes a user-defined oom policy. Currently
>> it's just one handler and the policy name. Later additional handlers
>> can be added, e.g. a handler to control the dmesg output.
>>
>> bpf_oom is an implementation detail: it's basically an extension
>> to struct bpf_oom_ops which contains "private" fields required
>> for the internal machinery.
> Ok. I hope we can come up with some more descriptive naming but I
> can't think of something good ATM.
>
>>>> +
>>>> +       /* Private */
>>>> +       struct bpf_oom *bpf_oom;
>>>> +};
>>>> +
>>>> +#ifdef CONFIG_BPF_SYSCALL
>>>> +/**
>>>> + * @bpf_handle_oom: handle out of memory using bpf programs
>>>> + * @oc: OOM control structure
>>>> + *
>>>> + * Returns true if a bpf oom program was executed, returned 1
>>>> + * and some memory was actually freed.
>>> The above comment is unclear, please clarify.
>> Fixed, thanks.
>>
>> /**
>>   * @bpf_handle_oom: handle out of memory condition using bpf
>>   * @oc: OOM control structure
>>   *
>>   * Returns true if some memory was freed.
>>   */
>> bool bpf_handle_oom(struct oom_control *oc);
>>
>>
>>>> + */
>>>> +bool bpf_handle_oom(struct oom_control *oc);
>>>> +
>>>> +#else /* CONFIG_BPF_SYSCALL */
>>>> +static inline bool bpf_handle_oom(struct oom_control *oc)
>>>> +{
>>>> +       return false;
>>>> +}
>>>> +
>>>> +#endif /* CONFIG_BPF_SYSCALL */
>>>> +
>>>> +#endif /* __BPF_OOM_H */
>>>> diff --git a/include/linux/oom.h b/include/linux/oom.h
>>>> index 1e0fc6931ce9..ef453309b7ea 100644
>>>> --- a/include/linux/oom.h
>>>> +++ b/include/linux/oom.h
>>>> @@ -51,6 +51,14 @@ struct oom_control {
>>>>
>>>>          /* Used to print the constraint info. */
>>>>          enum oom_constraint constraint;
>>>> +
>>>> +#ifdef CONFIG_BPF_SYSCALL
>>>> +       /* Used by the bpf oom implementation to mark the forward progress */
>>>> +       bool bpf_memory_freed;
>>>> +
>>>> +       /* Policy name */
>>>> +       const char *bpf_policy_name;
>>>> +#endif
>>>>   };
>>>>
>>>>   extern struct mutex oom_lock;
>>>> diff --git a/mm/Makefile b/mm/Makefile
>>>> index 1a7a11d4933d..a714aba03759 100644
>>>> --- a/mm/Makefile
>>>> +++ b/mm/Makefile
>>>> @@ -105,6 +105,9 @@ obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
>>>>   ifdef CONFIG_SWAP
>>>>   obj-$(CONFIG_MEMCG) += swap_cgroup.o
>>>>   endif
>>>> +ifdef CONFIG_BPF_SYSCALL
>>>> +obj-y += bpf_oom.o
>>>> +endif
>>>>   obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
>>>>   obj-$(CONFIG_GUP_TEST) += gup_test.o
>>>>   obj-$(CONFIG_DMAPOOL_TEST) += dmapool_test.o
>>>> diff --git a/mm/bpf_oom.c b/mm/bpf_oom.c
>>>> new file mode 100644
>>>> index 000000000000..47633046819c
>>>> --- /dev/null
>>>> +++ b/mm/bpf_oom.c
>>>> @@ -0,0 +1,157 @@
>>>> +// SPDX-License-Identifier: GPL-2.0-or-later
>>>> +/*
>>>> + * BPF-driven OOM killer customization
>>>> + *
>>>> + * Author: Roman Gushchin <roman.gushchin@linux.dev>
>>>> + */
>>>> +
>>>> +#include <linux/bpf.h>
>>>> +#include <linux/oom.h>
>>>> +#include <linux/bpf_oom.h>
>>>> +#include <linux/srcu.h>
>>>> +
>>>> +DEFINE_STATIC_SRCU(bpf_oom_srcu);
>>>> +static DEFINE_SPINLOCK(bpf_oom_lock);
>>>> +static LIST_HEAD(bpf_oom_handlers);
>>>> +
>>>> +struct bpf_oom {
>>> Perhaps bpf_oom_handler ? Then bpf_oom_ops->bpf_oom could be called
>>> bpf_oom_ops->handler.
>> I don't think it's a handler, it's more like a private part
>> of bpf_oom_ops. Maybe bpf_oom_impl? Idk
> Yeah, we need to come up with some nomenclature and name these structs
> accordingly. In my mind ops means a structure that contains only
> operations, so current naming does not sit well but maybe that's just
> me...

Some existing xxx_ops also have non-operation members. E.g., 
tcp_congestion_ops, Qdisc_ops, vfio_device_ops, or tpm_class_ops. Maybe 
bpf_oom_ops is okay if that doesn't cause too much confusion?

>
>>>
>>>> +       struct bpf_oom_ops *ops;
>>>> +       struct list_head node;
>>>> +       struct srcu_struct srcu;
>>>> +};
>>>> +
>>>> +bool bpf_handle_oom(struct oom_control *oc)
>>>> +{
>>>> +       struct bpf_oom_ops *ops;
>>>> +       struct bpf_oom *bpf_oom;
>>>> +       int list_idx, idx, ret = 0;
>>>> +
>>>> +       oc->bpf_memory_freed = false;
>>>> +
>>>> +       list_idx = srcu_read_lock(&bpf_oom_srcu);
>>>> +       list_for_each_entry_srcu(bpf_oom, &bpf_oom_handlers, node, false) {
>>>> +               ops = READ_ONCE(bpf_oom->ops);
>>>> +               if (!ops || !ops->handle_out_of_memory)
>>>> +                       continue;
>>>> +               idx = srcu_read_lock(&bpf_oom->srcu);
>>>> +               oc->bpf_policy_name = ops->name[0] ? &ops->name[0] :
>>>> +                       "bpf_defined_policy";
>>>> +               ret = ops->handle_out_of_memory(oc);
>>>> +               oc->bpf_policy_name = NULL;
>>>> +               srcu_read_unlock(&bpf_oom->srcu, idx);
>>>> +
>>>> +               if (ret && oc->bpf_memory_freed)
>>> IIUC ret and oc->bpf_memory_freed seem to reflect the same state:
>>> handler successfully freed some memory. Could you please clarify when
>>> they differ?
>> The idea here is to provide an additional safety measure:
>> if the bpf program simple returns 1 without doing anything,
>> the system won't deadlock.
>>
>> oc->bpf_memory_freed is set by the bpf_oom_kill_process() helper
>> (and potentially some other helpers in the future, e.g.
>> bpf_oom_rm_tmpfs_file()) and can't be modified by the bpf
>> program directly.
> I see. Then maybe we use only oc->bpf_memory_freed and
> handle_out_of_memory() does not return anything?



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 01/14] mm: introduce bpf struct ops for OOM handling
  2025-08-18 17:01 ` [PATCH v1 01/14] mm: introduce bpf struct ops for OOM handling Roman Gushchin
  2025-08-19  4:09   ` Suren Baghdasaryan
  2025-08-20 11:28   ` Kumar Kartikeya Dwivedi
@ 2025-08-26 16:56   ` Amery Hung
  2 siblings, 0 replies; 67+ messages in thread
From: Amery Hung @ 2025-08-26 16:56 UTC (permalink / raw)
  To: Roman Gushchin, linux-mm, bpf
  Cc: Suren Baghdasaryan, Johannes Weiner, Michal Hocko, David Rientjes,
	Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel, Martin KaFai Lau



On 8/18/25 10:01 AM, Roman Gushchin wrote:
> Introduce a bpf struct ops for implementing custom OOM handling policies.
>
> The struct ops provides the bpf_handle_out_of_memory() callback,
> which expected to return 1 if it was able to free some memory and 0
> otherwise.
>
> In the latter case it's guaranteed that the in-kernel OOM killer will
> be invoked. Otherwise the kernel also checks the bpf_memory_freed
> field of the oom_control structure, which is expected to be set by
> kfuncs suitable for releasing memory. It's a safety mechanism which
> prevents a bpf program to claim forward progress without actually
> releasing memory. The callback program is sleepable to enable using
> iterators, e.g. cgroup iterators.
>
> The callback receives struct oom_control as an argument, so it can
> easily filter out OOM's it doesn't want to handle, e.g. global vs
> memcg OOM's.
>
> The callback is executed just before the kernel victim task selection
> algorithm, so all heuristics and sysctls like panic on oom,
> sysctl_oom_kill_allocating_task and sysctl_oom_kill_allocating_task
> are respected.
>
> The struct ops also has the name field, which allows to define a
> custom name for the implemented policy. It's printed in the OOM report
> in the oom_policy=<policy> format. "default" is printed if bpf is not
> used or policy name is not specified.
>
> [  112.696676] test_progs invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
>                 oom_policy=bpf_test_policy
> [  112.698160] CPU: 1 UID: 0 PID: 660 Comm: test_progs Not tainted 6.16.0-00015-gf09eb0d6badc #102 PREEMPT(full)
> [  112.698165] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-5.fc42 04/01/2014
> [  112.698167] Call Trace:
> [  112.698177]  <TASK>
> [  112.698182]  dump_stack_lvl+0x4d/0x70
> [  112.698192]  dump_header+0x59/0x1c6
> [  112.698199]  oom_kill_process.cold+0x8/0xef
> [  112.698206]  bpf_oom_kill_process+0x59/0xb0
> [  112.698216]  bpf_prog_7ecad0f36a167fd7_test_out_of_memory+0x2be/0x313
> [  112.698229]  bpf__bpf_oom_ops_handle_out_of_memory+0x47/0xaf
> [  112.698236]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  112.698240]  bpf_handle_oom+0x11a/0x1e0
> [  112.698250]  out_of_memory+0xab/0x5c0
> [  112.698258]  mem_cgroup_out_of_memory+0xbc/0x110
> [  112.698274]  try_charge_memcg+0x4b5/0x7e0
> [  112.698288]  charge_memcg+0x2f/0xc0
> [  112.698293]  __mem_cgroup_charge+0x30/0xc0
> [  112.698299]  do_anonymous_page+0x40f/0xa50
> [  112.698311]  __handle_mm_fault+0xbba/0x1140
> [  112.698317]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  112.698335]  handle_mm_fault+0xe6/0x370
> [  112.698343]  do_user_addr_fault+0x211/0x6a0
> [  112.698354]  exc_page_fault+0x75/0x1d0
> [  112.698363]  asm_exc_page_fault+0x26/0x30
> [  112.698366] RIP: 0033:0x7fa97236db00
>
> It's possible to load multiple bpf struct programs. In the case of
> oom, they will be executed one by one in the same order they been
> loaded until one of them returns 1 and bpf_memory_freed is set to 1
> - an indication that the memory was freed. This allows to have
> multiple bpf programs to focus on different types of OOM's - e.g.
> one program can only handle memcg OOM's in one memory cgroup.
> But the filtering is done in bpf - so it's fully flexible.
>
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> ---
>   include/linux/bpf_oom.h |  49 +++++++++++++
>   include/linux/oom.h     |   8 ++
>   mm/Makefile             |   3 +
>   mm/bpf_oom.c            | 157 ++++++++++++++++++++++++++++++++++++++++
>   mm/oom_kill.c           |  22 +++++-
>   5 files changed, 237 insertions(+), 2 deletions(-)
>   create mode 100644 include/linux/bpf_oom.h
>   create mode 100644 mm/bpf_oom.c
>
> diff --git a/include/linux/bpf_oom.h b/include/linux/bpf_oom.h
> new file mode 100644
> index 000000000000..29cb5ea41d97
> --- /dev/null
> +++ b/include/linux/bpf_oom.h
> @@ -0,0 +1,49 @@
> +/* SPDX-License-Identifier: GPL-2.0+ */
> +
> +#ifndef __BPF_OOM_H
> +#define __BPF_OOM_H
> +
> +struct bpf_oom;
> +struct oom_control;
> +
> +#define BPF_OOM_NAME_MAX_LEN 64
> +
> +struct bpf_oom_ops {
> +	/**
> +	 * @handle_out_of_memory: Out of memory bpf handler, called before
> +	 * the in-kernel OOM killer.
> +	 * @oc: OOM control structure
> +	 *
> +	 * Should return 1 if some memory was freed up, otherwise
> +	 * the in-kernel OOM killer is invoked.
> +	 */
> +	int (*handle_out_of_memory)(struct oom_control *oc);

I suggest adding "struct bpf_oom *" as the first argument to all 
bpf_oom_ops to future-proof. It will allow an bpf_oom kfunc or prog to 
refer to the struct_ops instance itself.

Since bpf_oom_ops allows multiple attachment, if a bpf_prog is shared 
between two bpf_oom, it will be able to infer which bpf_oom_ops is 
calling by this extra argument.


> +
> +	/**
> +	 * @name: BPF OOM policy name
> +	 */
> +	char name[BPF_OOM_NAME_MAX_LEN];
> +
> +	/* Private */
> +	struct bpf_oom *bpf_oom;
> +};
> +
> +#ifdef CONFIG_BPF_SYSCALL
> +/**
> + * @bpf_handle_oom: handle out of memory using bpf programs
> + * @oc: OOM control structure
> + *
> + * Returns true if a bpf oom program was executed, returned 1
> + * and some memory was actually freed.
> + */
> +bool bpf_handle_oom(struct oom_control *oc);
> +
> +#else /* CONFIG_BPF_SYSCALL */
> +static inline bool bpf_handle_oom(struct oom_control *oc)
> +{
> +	return false;
> +}
> +
> +#endif /* CONFIG_BPF_SYSCALL */
> +
> +#endif /* __BPF_OOM_H */
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 1e0fc6931ce9..ef453309b7ea 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -51,6 +51,14 @@ struct oom_control {
>   
>   	/* Used to print the constraint info. */
>   	enum oom_constraint constraint;
> +
> +#ifdef CONFIG_BPF_SYSCALL
> +	/* Used by the bpf oom implementation to mark the forward progress */
> +	bool bpf_memory_freed;
> +
> +	/* Policy name */
> +	const char *bpf_policy_name;
> +#endif
>   };
>   
>   extern struct mutex oom_lock;
> diff --git a/mm/Makefile b/mm/Makefile
> index 1a7a11d4933d..a714aba03759 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -105,6 +105,9 @@ obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
>   ifdef CONFIG_SWAP
>   obj-$(CONFIG_MEMCG) += swap_cgroup.o
>   endif
> +ifdef CONFIG_BPF_SYSCALL
> +obj-y += bpf_oom.o
> +endif
>   obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
>   obj-$(CONFIG_GUP_TEST) += gup_test.o
>   obj-$(CONFIG_DMAPOOL_TEST) += dmapool_test.o
> diff --git a/mm/bpf_oom.c b/mm/bpf_oom.c
> new file mode 100644
> index 000000000000..47633046819c
> --- /dev/null
> +++ b/mm/bpf_oom.c
> @@ -0,0 +1,157 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * BPF-driven OOM killer customization
> + *
> + * Author: Roman Gushchin <roman.gushchin@linux.dev>
> + */
> +
> +#include <linux/bpf.h>
> +#include <linux/oom.h>
> +#include <linux/bpf_oom.h>
> +#include <linux/srcu.h>
> +
> +DEFINE_STATIC_SRCU(bpf_oom_srcu);
> +static DEFINE_SPINLOCK(bpf_oom_lock);
> +static LIST_HEAD(bpf_oom_handlers);
> +
> +struct bpf_oom {
> +	struct bpf_oom_ops *ops;
> +	struct list_head node;
> +	struct srcu_struct srcu;
> +};
> +
> +bool bpf_handle_oom(struct oom_control *oc)
> +{
> +	struct bpf_oom_ops *ops;
> +	struct bpf_oom *bpf_oom;
> +	int list_idx, idx, ret = 0;
> +
> +	oc->bpf_memory_freed = false;
> +
> +	list_idx = srcu_read_lock(&bpf_oom_srcu);
> +	list_for_each_entry_srcu(bpf_oom, &bpf_oom_handlers, node, false) {
> +		ops = READ_ONCE(bpf_oom->ops);
> +		if (!ops || !ops->handle_out_of_memory)
> +			continue;
> +		idx = srcu_read_lock(&bpf_oom->srcu);
> +		oc->bpf_policy_name = ops->name[0] ? &ops->name[0] :
> +			"bpf_defined_policy";
> +		ret = ops->handle_out_of_memory(oc);
> +		oc->bpf_policy_name = NULL;
> +		srcu_read_unlock(&bpf_oom->srcu, idx);
> +
> +		if (ret && oc->bpf_memory_freed)
> +			break;
> +	}
> +	srcu_read_unlock(&bpf_oom_srcu, list_idx);
> +
> +	return ret && oc->bpf_memory_freed;
> +}
> +
> +static int __handle_out_of_memory(struct oom_control *oc)
> +{
> +	return 0;
> +}
> +
> +static struct bpf_oom_ops __bpf_oom_ops = {
> +	.handle_out_of_memory = __handle_out_of_memory,
> +};
> +
> +static const struct bpf_func_proto *
> +bpf_oom_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> +{
> +	return tracing_prog_func_proto(func_id, prog);
> +}
> +
> +static bool bpf_oom_ops_is_valid_access(int off, int size,
> +					enum bpf_access_type type,
> +					const struct bpf_prog *prog,
> +					struct bpf_insn_access_aux *info)
> +{
> +	return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
> +}
> +
> +static const struct bpf_verifier_ops bpf_oom_verifier_ops = {
> +	.get_func_proto = bpf_oom_func_proto,
> +	.is_valid_access = bpf_oom_ops_is_valid_access,
> +};
> +
> +static int bpf_oom_ops_reg(void *kdata, struct bpf_link *link)
> +{
> +	struct bpf_oom_ops *ops = kdata;
> +	struct bpf_oom *bpf_oom;
> +	int ret;
> +
> +	bpf_oom = kmalloc(sizeof(*bpf_oom), GFP_KERNEL_ACCOUNT);
> +	if (!bpf_oom)
> +		return -ENOMEM;
> +
> +	ret = init_srcu_struct(&bpf_oom->srcu);
> +	if (ret) {
> +		kfree(bpf_oom);
> +		return ret;
> +	}
> +
> +	WRITE_ONCE(bpf_oom->ops, ops);
> +	ops->bpf_oom = bpf_oom;
> +
> +	spin_lock(&bpf_oom_lock);
> +	list_add_rcu(&bpf_oom->node, &bpf_oom_handlers);
> +	spin_unlock(&bpf_oom_lock);
> +
> +	return 0;
> +}
> +
> +static void bpf_oom_ops_unreg(void *kdata, struct bpf_link *link)
> +{
> +	struct bpf_oom_ops *ops = kdata;
> +	struct bpf_oom *bpf_oom = ops->bpf_oom;
> +
> +	WRITE_ONCE(bpf_oom->ops, NULL);
> +
> +	spin_lock(&bpf_oom_lock);
> +	list_del_rcu(&bpf_oom->node);
> +	spin_unlock(&bpf_oom_lock);
> +
> +	synchronize_srcu(&bpf_oom->srcu);
> +
> +	kfree(bpf_oom);
> +}
> +
> +static int bpf_oom_ops_init_member(const struct btf_type *t,
> +				   const struct btf_member *member,
> +				   void *kdata, const void *udata)
> +{
> +	const struct bpf_oom_ops *uops = (const struct bpf_oom_ops *)udata;
> +	struct bpf_oom_ops *ops = (struct bpf_oom_ops *)kdata;
> +	u32 moff = __btf_member_bit_offset(t, member) / 8;
> +
> +	switch (moff) {
> +	case offsetof(struct bpf_oom_ops, name):
> +		strscpy_pad(ops->name, uops->name, sizeof(ops->name));
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +static int bpf_oom_ops_init(struct btf *btf)
> +{
> +	return 0;
> +}
> +
> +static struct bpf_struct_ops bpf_oom_bpf_ops = {
> +	.verifier_ops = &bpf_oom_verifier_ops,
> +	.reg = bpf_oom_ops_reg,
> +	.unreg = bpf_oom_ops_unreg,
> +	.init_member = bpf_oom_ops_init_member,
> +	.init = bpf_oom_ops_init,
> +	.name = "bpf_oom_ops",
> +	.owner = THIS_MODULE,
> +	.cfi_stubs = &__bpf_oom_ops
> +};
> +
> +static int __init bpf_oom_struct_ops_init(void)
> +{
> +	return register_bpf_struct_ops(&bpf_oom_bpf_ops, bpf_oom_ops);
> +}
> +late_initcall(bpf_oom_struct_ops_init);
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 25923cfec9c6..ad7bd65061d6 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -45,6 +45,7 @@
>   #include <linux/mmu_notifier.h>
>   #include <linux/cred.h>
>   #include <linux/nmi.h>
> +#include <linux/bpf_oom.h>
>   
>   #include <asm/tlb.h>
>   #include "internal.h"
> @@ -246,6 +247,15 @@ static const char * const oom_constraint_text[] = {
>   	[CONSTRAINT_MEMCG] = "CONSTRAINT_MEMCG",
>   };
>   
> +static const char *oom_policy_name(struct oom_control *oc)
> +{
> +#ifdef CONFIG_BPF_SYSCALL
> +	if (oc->bpf_policy_name)
> +		return oc->bpf_policy_name;
> +#endif
> +	return "default";
> +}
> +
>   /*
>    * Determine the type of allocation constraint.
>    */
> @@ -458,9 +468,10 @@ static void dump_oom_victim(struct oom_control *oc, struct task_struct *victim)
>   
>   static void dump_header(struct oom_control *oc)
>   {
> -	pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), order=%d, oom_score_adj=%hd\n",
> +	pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), order=%d, oom_score_adj=%hd\noom_policy=%s\n",
>   		current->comm, oc->gfp_mask, &oc->gfp_mask, oc->order,
> -			current->signal->oom_score_adj);
> +		current->signal->oom_score_adj,
> +		oom_policy_name(oc));
>   	if (!IS_ENABLED(CONFIG_COMPACTION) && oc->order)
>   		pr_warn("COMPACTION is disabled!!!\n");
>   
> @@ -1161,6 +1172,13 @@ bool out_of_memory(struct oom_control *oc)
>   		return true;
>   	}
>   
> +	/*
> +	 * Let bpf handle the OOM first. If it was able to free up some memory,
> +	 * bail out. Otherwise fall back to the kernel OOM killer.
> +	 */
> +	if (bpf_handle_oom(oc))
> +		return true;
> +
>   	select_bad_process(oc);
>   	/* Found nothing?!?! */
>   	if (!oc->chosen) {



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 12/14] sched: psi: implement psi trigger handling using bpf
  2025-08-18 17:01 ` [PATCH v1 12/14] sched: psi: implement psi trigger handling using bpf Roman Gushchin
  2025-08-19  4:11   ` Suren Baghdasaryan
@ 2025-08-26 17:03   ` Amery Hung
  1 sibling, 0 replies; 67+ messages in thread
From: Amery Hung @ 2025-08-26 17:03 UTC (permalink / raw)
  To: Roman Gushchin, linux-mm, bpf
  Cc: Suren Baghdasaryan, Johannes Weiner, Michal Hocko, David Rientjes,
	Matt Bobrowski, Song Liu, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Andrew Morton, linux-kernel



On 8/18/25 10:01 AM, Roman Gushchin wrote:
> This patch implements a bpf struct ops-based mechanism to create
> psi triggers, attach them to cgroups or system wide and handle
> psi events in bpf.
>
> The struct ops provides 3 callbacks:
>    - init() called once at load, handy for creating psi triggers
>    - handle_psi_event() called every time a psi trigger fires
>    - handle_cgroup_free() called if a cgroup with an attached
>      trigger is being freed
>
> A single struct ops can create a number of psi triggers, both
> cgroup-scoped and system-wide.
>
> All 3 struct ops callbacks can be sleepable. handle_psi_event()
> handlers are executed using a separate workqueue, so it won't
> affect the latency of other psi triggers.
>
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> ---
>   include/linux/bpf_psi.h      |  71 ++++++++++
>   include/linux/psi_types.h    |  43 +++++-
>   kernel/sched/bpf_psi.c       | 253 +++++++++++++++++++++++++++++++++++
>   kernel/sched/build_utility.c |   4 +
>   kernel/sched/psi.c           |  49 +++++--
>   5 files changed, 408 insertions(+), 12 deletions(-)
>   create mode 100644 include/linux/bpf_psi.h
>   create mode 100644 kernel/sched/bpf_psi.c
>
> diff --git a/include/linux/bpf_psi.h b/include/linux/bpf_psi.h
> new file mode 100644
> index 000000000000..826ab89ac11c
> --- /dev/null
> +++ b/include/linux/bpf_psi.h
> @@ -0,0 +1,71 @@
> +/* SPDX-License-Identifier: GPL-2.0+ */
> +
> +#ifndef __BPF_PSI_H
> +#define __BPF_PSI_H
> +
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +#include <linux/srcu.h>
> +#include <linux/psi_types.h>
> +
> +struct cgroup;
> +struct bpf_psi;
> +struct psi_trigger;
> +struct psi_trigger_params;
> +
> +#define BPF_PSI_FULL 0x80000000
> +
> +struct bpf_psi_ops {
> +	/**
> +	 * @init: Initialization callback, suited for creating psi triggers.
> +	 * @bpf_psi: bpf_psi pointer, can be passed to bpf_psi_create_trigger().
> +	 *
> +	 * A non-0 return value means the initialization has been failed.
> +	 */
> +	int (*init)(struct bpf_psi *bpf_psi);
> +
> +	/**
> +	 * @handle_psi_event: PSI event callback
> +	 * @t: psi_trigger pointer
> +	 */
> +	void (*handle_psi_event)(struct psi_trigger *t);
> +

[...]

> +	/**
> +	 * @handle_cgroup_free: Cgroup free callback
> +	 * @cgroup_id: Id of freed cgroup
> +	 *
> +	 * Called every time a cgroup with an attached bpf psi trigger is freed.
> +	 * No psi events can be raised after handle_cgroup_free().
> +	 */
> +	void (*handle_cgroup_free)(u64 cgroup_id);

For the same reason mentioned in patch 1, I'd add bpf_psi_ops as an 
argument to the operations.

> +
> +	/* private */
> +	struct bpf_psi *bpf_psi;
> +};
> +
> +struct bpf_psi {
> +	spinlock_t lock;
> +	struct list_head triggers;
> +	struct bpf_psi_ops *ops;
> +	struct srcu_struct srcu;
> +};
> +
> +#ifdef CONFIG_BPF_SYSCALL
> +void bpf_psi_add_trigger(struct psi_trigger *t,
> +			 const struct psi_trigger_params *params);
> +void bpf_psi_remove_trigger(struct psi_trigger *t);
> +void bpf_psi_handle_event(struct psi_trigger *t);
> +#ifdef CONFIG_CGROUPS
> +void bpf_psi_cgroup_free(struct cgroup *cgroup);
> +#endif
> +
> +#else /* CONFIG_BPF_SYSCALL */
> +static inline void bpf_psi_add_trigger(struct psi_trigger *t,
> +			const struct psi_trigger_params *params) {}
> +static inline void bpf_psi_remove_trigger(struct psi_trigger *t) {}
> +static inline void bpf_psi_handle_event(struct psi_trigger *t) {}
> +static inline void bpf_psi_cgroup_free(struct cgroup *cgroup) {}
> +
> +#endif /* CONFIG_BPF_SYSCALL */
> +
> +#endif /* __BPF_PSI_H */
> diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
> index cea54121d9b9..f695cc34cfd4 100644
> --- a/include/linux/psi_types.h
> +++ b/include/linux/psi_types.h
> @@ -124,6 +124,7 @@ struct psi_window {
>   enum psi_trigger_type {
>   	PSI_SYSTEM,
>   	PSI_CGROUP,
> +	PSI_BPF,
>   };
>   
>   struct psi_trigger_params {
> @@ -145,8 +146,15 @@ struct psi_trigger_params {
>   	/* Privileged triggers are treated differently */
>   	bool privileged;
>   
> -	/* Link to kernfs open file, only for PSI_CGROUP */
> -	struct kernfs_open_file *of;
> +	union {
> +		/* Link to kernfs open file, only for PSI_CGROUP */
> +		struct kernfs_open_file *of;
> +
> +#ifdef CONFIG_BPF_SYSCALL
> +		/* Link to bpf_psi structure, only for BPF_PSI */
> +		struct bpf_psi *bpf_psi;
> +#endif
> +	};
>   };
>   
>   struct psi_trigger {
> @@ -188,6 +196,31 @@ struct psi_trigger {
>   
>   	/* Trigger type - PSI_AVGS for unprivileged, PSI_POLL for RT */
>   	enum psi_aggregators aggregator;
> +
> +#ifdef CONFIG_BPF_SYSCALL
> +	/* Fields specific to PSI_BPF triggers */
> +
> +	/* Bpf psi structure for events handling */
> +	struct bpf_psi *bpf_psi;
> +
> +	/* List node inside bpf_psi->triggers list */
> +	struct list_head bpf_psi_node;
> +
> +	/* List node inside group->bpf_triggers list */
> +	struct list_head bpf_group_node;
> +
> +	/* Work structure, used to execute event handlers */
> +	struct work_struct bpf_work;
> +
> +	/*
> +	 * Whether the trigger is being pinned in memory.
> +	 * Protected by group->bpf_triggers_lock.
> +	 */
> +	bool pinned;
> +
> +	/* Cgroup Id */
> +	u64 cgroup_id;
> +#endif
>   };
>   
>   struct psi_group {
> @@ -236,6 +269,12 @@ struct psi_group {
>   	u64 rtpoll_total[NR_PSI_STATES - 1];
>   	u64 rtpoll_next_update;
>   	u64 rtpoll_until;
> +
> +#ifdef CONFIG_BPF_SYSCALL
> +	/* List of triggers owned by bpf and corresponding lock */
> +	spinlock_t bpf_triggers_lock;
> +	struct list_head bpf_triggers;
> +#endif
>   };
>   
>   #else /* CONFIG_PSI */
> diff --git a/kernel/sched/bpf_psi.c b/kernel/sched/bpf_psi.c
> new file mode 100644
> index 000000000000..2ea9d7276b21
> --- /dev/null
> +++ b/kernel/sched/bpf_psi.c
> @@ -0,0 +1,253 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * BPF PSI event handlers
> + *
> + * Author: Roman Gushchin <roman.gushchin@linux.dev>
> + */
> +
> +#include <linux/bpf_psi.h>
> +#include <linux/cgroup-defs.h>
> +
> +static struct workqueue_struct *bpf_psi_wq;
> +
> +static struct bpf_psi *bpf_psi_create(struct bpf_psi_ops *ops)
> +{
> +	struct bpf_psi *bpf_psi;
> +
> +	bpf_psi = kzalloc(sizeof(*bpf_psi), GFP_KERNEL);
> +	if (!bpf_psi)
> +		return NULL;
> +
> +	if (init_srcu_struct(&bpf_psi->srcu)) {
> +		kfree(bpf_psi);
> +		return NULL;
> +	}
> +
> +	spin_lock_init(&bpf_psi->lock);
> +	bpf_psi->ops = ops;
> +	INIT_LIST_HEAD(&bpf_psi->triggers);
> +	ops->bpf_psi = bpf_psi;
> +
> +	return bpf_psi;
> +}
> +
> +static void bpf_psi_free(struct bpf_psi *bpf_psi)
> +{
> +	cleanup_srcu_struct(&bpf_psi->srcu);
> +	kfree(bpf_psi);
> +}
> +
> +static void bpf_psi_handle_event_fn(struct work_struct *work)
> +{
> +	struct psi_trigger *t;
> +	struct bpf_psi *bpf_psi;
> +	int idx;
> +
> +	t = container_of(work, struct psi_trigger, bpf_work);
> +	bpf_psi = READ_ONCE(t->bpf_psi);
> +
> +	if (likely(bpf_psi)) {
> +		idx = srcu_read_lock(&bpf_psi->srcu);
> +		if (bpf_psi->ops->handle_psi_event)
> +			bpf_psi->ops->handle_psi_event(t);
> +		srcu_read_unlock(&bpf_psi->srcu, idx);
> +	}
> +}
> +
> +void bpf_psi_add_trigger(struct psi_trigger *t,
> +			 const struct psi_trigger_params *params)
> +{
> +	t->bpf_psi = params->bpf_psi;
> +	t->pinned = false;
> +	INIT_WORK(&t->bpf_work, bpf_psi_handle_event_fn);
> +
> +	spin_lock(&t->bpf_psi->lock);
> +	list_add(&t->bpf_psi_node, &t->bpf_psi->triggers);
> +	spin_unlock(&t->bpf_psi->lock);
> +
> +	spin_lock(&t->group->bpf_triggers_lock);
> +	list_add(&t->bpf_group_node, &t->group->bpf_triggers);
> +	spin_unlock(&t->group->bpf_triggers_lock);
> +}
> +
> +void bpf_psi_remove_trigger(struct psi_trigger *t)
> +{
> +	spin_lock(&t->group->bpf_triggers_lock);
> +	list_del(&t->bpf_group_node);
> +	spin_unlock(&t->group->bpf_triggers_lock);
> +
> +	spin_lock(&t->bpf_psi->lock);
> +	list_del(&t->bpf_psi_node);
> +	spin_unlock(&t->bpf_psi->lock);
> +}
> +
> +#ifdef CONFIG_CGROUPS
> +void bpf_psi_cgroup_free(struct cgroup *cgroup)
> +{
> +	struct psi_group *group = cgroup->psi;
> +	u64 cgrp_id = cgroup_id(cgroup);
> +	struct psi_trigger *t, *p;
> +	struct bpf_psi *bpf_psi;
> +	LIST_HEAD(to_destroy);
> +	int idx;
> +
> +	spin_lock(&group->bpf_triggers_lock);
> +	list_for_each_entry_safe(t, p, &group->bpf_triggers, bpf_group_node) {
> +		if (!t->pinned) {
> +			t->pinned = true;
> +			list_move(&t->bpf_group_node, &to_destroy);
> +		}
> +	}
> +	spin_unlock(&group->bpf_triggers_lock);
> +
> +	list_for_each_entry_safe(t, p, &to_destroy, bpf_group_node) {
> +		bpf_psi = READ_ONCE(t->bpf_psi);
> +
> +		idx = srcu_read_lock(&bpf_psi->srcu);
> +		if (bpf_psi->ops->handle_cgroup_free)
> +			bpf_psi->ops->handle_cgroup_free(cgrp_id);
> +		srcu_read_unlock(&bpf_psi->srcu, idx);
> +
> +		spin_lock(&bpf_psi->lock);
> +		list_del(&t->bpf_psi_node);
> +		spin_unlock(&bpf_psi->lock);
> +
> +		WRITE_ONCE(t->bpf_psi, NULL);
> +		flush_workqueue(bpf_psi_wq);
> +		synchronize_srcu(&bpf_psi->srcu);
> +		psi_trigger_destroy(t);
> +	}
> +}
> +#endif
> +
> +void bpf_psi_handle_event(struct psi_trigger *t)
> +{
> +	queue_work(bpf_psi_wq, &t->bpf_work);
> +}
> +
> +// bpf struct ops
> +
> +static int __bpf_psi_init(struct bpf_psi *bpf_psi) { return 0; }
> +static void __bpf_psi_handle_psi_event(struct psi_trigger *t) {}
> +static void __bpf_psi_handle_cgroup_free(u64 cgroup_id) {}
> +
> +static struct bpf_psi_ops __bpf_psi_ops = {
> +	.init = __bpf_psi_init,
> +	.handle_psi_event = __bpf_psi_handle_psi_event,
> +	.handle_cgroup_free = __bpf_psi_handle_cgroup_free,
> +};
> +
> +static const struct bpf_func_proto *
> +bpf_psi_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> +{
> +	return tracing_prog_func_proto(func_id, prog);
> +}
> +
> +static bool bpf_psi_ops_is_valid_access(int off, int size,
> +					enum bpf_access_type type,
> +					const struct bpf_prog *prog,
> +					struct bpf_insn_access_aux *info)
> +{
> +	return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
> +}
> +
> +static const struct bpf_verifier_ops bpf_psi_verifier_ops = {
> +	.get_func_proto = bpf_psi_func_proto,
> +	.is_valid_access = bpf_psi_ops_is_valid_access,
> +};
> +
> +static int bpf_psi_ops_reg(void *kdata, struct bpf_link *link)
> +{
> +	struct bpf_psi_ops *ops = kdata;
> +	struct bpf_psi *bpf_psi;
> +
> +	bpf_psi = bpf_psi_create(ops);
> +	if (!bpf_psi)
> +		return -ENOMEM;
> +
> +	return ops->init(bpf_psi);
> +}
> +
> +static void bpf_psi_ops_unreg(void *kdata, struct bpf_link *link)
> +{
> +	struct bpf_psi_ops *ops = kdata;
> +	struct bpf_psi *bpf_psi = ops->bpf_psi;
> +	struct psi_trigger *t, *p;
> +	LIST_HEAD(to_destroy);
> +
> +	spin_lock(&bpf_psi->lock);
> +	list_for_each_entry_safe(t, p, &bpf_psi->triggers, bpf_psi_node) {
> +		spin_lock(&t->group->bpf_triggers_lock);
> +		if (!t->pinned) {
> +			t->pinned = true;
> +			list_move(&t->bpf_group_node, &to_destroy);
> +			list_del(&t->bpf_psi_node);
> +
> +			WRITE_ONCE(t->bpf_psi, NULL);
> +		}
> +		spin_unlock(&t->group->bpf_triggers_lock);
> +	}
> +	spin_unlock(&bpf_psi->lock);
> +
> +	flush_workqueue(bpf_psi_wq);
> +	synchronize_srcu(&bpf_psi->srcu);
> +
> +	list_for_each_entry_safe(t, p, &to_destroy, bpf_group_node)
> +		psi_trigger_destroy(t);
> +
> +	bpf_psi_free(bpf_psi);
> +}
> +
> +static int bpf_psi_ops_check_member(const struct btf_type *t,
> +				    const struct btf_member *member,
> +				    const struct bpf_prog *prog)
> +{
> +	return 0;
> +}
> +
> +static int bpf_psi_ops_init_member(const struct btf_type *t,
> +				   const struct btf_member *member,
> +				   void *kdata, const void *udata)
> +{
> +	return 0;
> +}
> +
> +static int bpf_psi_ops_init(struct btf *btf)
> +{
> +	return 0;
> +}
> +
> +static struct bpf_struct_ops bpf_psi_bpf_ops = {
> +	.verifier_ops = &bpf_psi_verifier_ops,
> +	.reg = bpf_psi_ops_reg,
> +	.unreg = bpf_psi_ops_unreg,
> +	.check_member = bpf_psi_ops_check_member,
> +	.init_member = bpf_psi_ops_init_member,
> +	.init = bpf_psi_ops_init,
> +	.name = "bpf_psi_ops",
> +	.owner = THIS_MODULE,
> +	.cfi_stubs = &__bpf_psi_ops
> +};
> +
> +static int __init bpf_psi_struct_ops_init(void)
> +{
> +	int wq_flags = WQ_MEM_RECLAIM | WQ_UNBOUND | WQ_HIGHPRI;
> +	int err;
> +
> +	bpf_psi_wq = alloc_workqueue("bpf_psi_wq", wq_flags, 0);
> +	if (!bpf_psi_wq)
> +		return -ENOMEM;
> +
> +	err = register_bpf_struct_ops(&bpf_psi_bpf_ops, bpf_psi_ops);
> +	if (err) {
> +		pr_warn("error while registering bpf psi struct ops: %d", err);
> +		goto err;
> +	}
> +
> +	return 0;
> +
> +err:
> +	destroy_workqueue(bpf_psi_wq);
> +	return err;
> +}
> +late_initcall(bpf_psi_struct_ops_init);
> diff --git a/kernel/sched/build_utility.c b/kernel/sched/build_utility.c
> index bf9d8db94b70..80f3799a2fa6 100644
> --- a/kernel/sched/build_utility.c
> +++ b/kernel/sched/build_utility.c
> @@ -19,6 +19,7 @@
>   #include <linux/sched/rseq_api.h>
>   #include <linux/sched/task_stack.h>
>   
> +#include <linux/bpf_psi.h>
>   #include <linux/cpufreq.h>
>   #include <linux/cpumask_api.h>
>   #include <linux/cpuset.h>
> @@ -92,6 +93,9 @@
>   
>   #ifdef CONFIG_PSI
>   # include "psi.c"
> +# ifdef CONFIG_BPF_SYSCALL
> +#  include "bpf_psi.c"
> +# endif
>   #endif
>   
>   #ifdef CONFIG_MEMBARRIER
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index e1d8eaeeff17..e10fbbc34099 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -201,6 +201,10 @@ static void group_init(struct psi_group *group)
>   	init_waitqueue_head(&group->rtpoll_wait);
>   	timer_setup(&group->rtpoll_timer, poll_timer_fn, 0);
>   	rcu_assign_pointer(group->rtpoll_task, NULL);
> +#ifdef CONFIG_BPF_SYSCALL
> +	spin_lock_init(&group->bpf_triggers_lock);
> +	INIT_LIST_HEAD(&group->bpf_triggers);
> +#endif
>   }
>   
>   void __init psi_init(void)
> @@ -489,10 +493,17 @@ static void update_triggers(struct psi_group *group, u64 now,
>   
>   		/* Generate an event */
>   		if (cmpxchg(&t->event, 0, 1) == 0) {
> -			if (t->type == PSI_CGROUP)
> -				kernfs_notify(t->of->kn);
> -			else
> +			switch (t->type) {
> +			case PSI_SYSTEM:
>   				wake_up_interruptible(&t->event_wait);
> +				break;
> +			case PSI_CGROUP:
> +				kernfs_notify(t->of->kn);
> +				break;
> +			case PSI_BPF:
> +				bpf_psi_handle_event(t);
> +				break;
> +			}
>   		}
>   		t->last_event_time = now;
>   		/* Reset threshold breach flag once event got generated */
> @@ -1125,6 +1136,7 @@ void psi_cgroup_free(struct cgroup *cgroup)
>   		return;
>   
>   	cancel_delayed_work_sync(&cgroup->psi->avgs_work);
> +	bpf_psi_cgroup_free(cgroup);
>   	free_percpu(cgroup->psi->pcpu);
>   	/* All triggers must be removed by now */
>   	WARN_ONCE(cgroup->psi->rtpoll_states, "psi: trigger leak\n");
> @@ -1356,6 +1368,9 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group,
>   	case PSI_CGROUP:
>   		t->of = params->of;
>   		break;
> +	case PSI_BPF:
> +		bpf_psi_add_trigger(t, params);
> +		break;
>   	}
>   
>   	t->pending_event = false;
> @@ -1369,8 +1384,10 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group,
>   
>   			task = kthread_create(psi_rtpoll_worker, group, "psimon");
>   			if (IS_ERR(task)) {
> -				kfree(t);
>   				mutex_unlock(&group->rtpoll_trigger_lock);
> +				if (t->type == PSI_BPF)
> +					bpf_psi_remove_trigger(t);
> +				kfree(t);
>   				return ERR_CAST(task);
>   			}
>   			atomic_set(&group->rtpoll_wakeup, 0);
> @@ -1414,10 +1431,16 @@ void psi_trigger_destroy(struct psi_trigger *t)
>   	 * being accessed later. Can happen if cgroup is deleted from under a
>   	 * polling process.
>   	 */
> -	if (t->type == PSI_CGROUP)
> -		kernfs_notify(t->of->kn);
> -	else
> +	switch (t->type) {
> +	case PSI_SYSTEM:
>   		wake_up_interruptible(&t->event_wait);
> +		break;
> +	case PSI_CGROUP:
> +		kernfs_notify(t->of->kn);
> +		break;
> +	case PSI_BPF:
> +		break;
> +	}
>   
>   	if (t->aggregator == PSI_AVGS) {
>   		mutex_lock(&group->avgs_lock);
> @@ -1494,10 +1517,16 @@ __poll_t psi_trigger_poll(void **trigger_ptr,
>   	if (!t)
>   		return DEFAULT_POLLMASK | EPOLLERR | EPOLLPRI;
>   
> -	if (t->type == PSI_CGROUP)
> -		kernfs_generic_poll(t->of, wait);
> -	else
> +	switch (t->type) {
> +	case PSI_SYSTEM:
>   		poll_wait(file, &t->event_wait, wait);
> +		break;
> +	case PSI_CGROUP:
> +		kernfs_generic_poll(t->of, wait);
> +		break;
> +	case PSI_BPF:
> +		break;
> +	}
>   
>   	if (cmpxchg(&t->event, 1, 0) == 1)
>   		ret |= EPOLLPRI;



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 01/14] mm: introduce bpf struct ops for OOM handling
  2025-08-25 17:00         ` Roman Gushchin
@ 2025-08-26 18:01           ` Martin KaFai Lau
  2025-08-26 19:52             ` Alexei Starovoitov
  0 siblings, 1 reply; 67+ messages in thread
From: Martin KaFai Lau @ 2025-08-26 18:01 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Kumar Kartikeya Dwivedi, linux-mm, bpf, Suren Baghdasaryan,
	Johannes Weiner, Michal Hocko, David Rientjes, Matt Bobrowski,
	Song Liu, Alexei Starovoitov, Andrew Morton, linux-kernel

On 8/25/25 10:00 AM, Roman Gushchin wrote:
> Martin KaFai Lau <martin.lau@linux.dev> writes:
> 
>> On 8/20/25 5:24 PM, Roman Gushchin wrote:
>>>> How is it decided who gets to run before the other? Is it based on
>>>> order of attachment (which can be non-deterministic)?
>>> Yeah, now it's the order of attachment.
>>>
>>>> There was a lot of discussion on something similar for tc progs, and
>>>> we went with specific flags that capture partial ordering constraints
>>>> (instead of priorities that may collide).
>>>> https://lore.kernel.org/all/20230719140858.13224-2-daniel@iogearbox.net
>>>> It would be nice if we can find a way of making this consistent.
>>
>> +1
>>
>> The cgroup bpf prog has recently added the mprog api support also. If
>> the simple order of attachment is not enough and needs to have
>> specific ordering, we should make the bpf struct_ops support the same
>> mprog api instead of asking each subsystem creating its own.
>>
>> fyi, another need for struct_ops ordering is to upgrade the
>> BPF_PROG_TYPE_SOCK_OPS api to struct_ops for easier extension in the
>> future. Slide 13 in
>> https://drive.google.com/file/d/1wjKZth6T0llLJ_ONPAL_6Q_jbxbAjByp/view
> 
> Does it mean it's better now to keep it simple in the context of oom
> patches with the plan to later reuse the generic struct_ops
> infrastructure?
> 
> Honestly, I believe that the simple order of attachment should be
> good enough for quite a while, so I'd not over-complicate this,
> unless it's not fixable later.

I think the simple attachment ordering is fine. Presumably the current link list 
in patch 1 can be replaced by the mprog in the future. Other experts can chime 
in if I have missed things.

Once it needs to have an ordering api in the future, it should probably stay 
with mprog instead of each subsystem creating its own. The inspection tool 
(likely a subcmd in bpftool) can also be created to inspect the struct_ops order 
of a subsystem.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 01/14] mm: introduce bpf struct ops for OOM handling
  2025-08-26 18:01           ` Martin KaFai Lau
@ 2025-08-26 19:52             ` Alexei Starovoitov
  2025-08-27 18:28               ` Roman Gushchin
  2025-09-02 17:31               ` Roman Gushchin
  0 siblings, 2 replies; 67+ messages in thread
From: Alexei Starovoitov @ 2025-08-26 19:52 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Roman Gushchin, Kumar Kartikeya Dwivedi, linux-mm, bpf,
	Suren Baghdasaryan, Johannes Weiner, Michal Hocko, David Rientjes,
	Matt Bobrowski, Song Liu, Alexei Starovoitov, Andrew Morton, LKML

On Tue, Aug 26, 2025 at 11:01 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 8/25/25 10:00 AM, Roman Gushchin wrote:
> > Martin KaFai Lau <martin.lau@linux.dev> writes:
> >
> >> On 8/20/25 5:24 PM, Roman Gushchin wrote:
> >>>> How is it decided who gets to run before the other? Is it based on
> >>>> order of attachment (which can be non-deterministic)?
> >>> Yeah, now it's the order of attachment.
> >>>
> >>>> There was a lot of discussion on something similar for tc progs, and
> >>>> we went with specific flags that capture partial ordering constraints
> >>>> (instead of priorities that may collide).
> >>>> https://lore.kernel.org/all/20230719140858.13224-2-daniel@iogearbox.net
> >>>> It would be nice if we can find a way of making this consistent.
> >>
> >> +1
> >>
> >> The cgroup bpf prog has recently added the mprog api support also. If
> >> the simple order of attachment is not enough and needs to have
> >> specific ordering, we should make the bpf struct_ops support the same
> >> mprog api instead of asking each subsystem creating its own.
> >>
> >> fyi, another need for struct_ops ordering is to upgrade the
> >> BPF_PROG_TYPE_SOCK_OPS api to struct_ops for easier extension in the
> >> future. Slide 13 in
> >> https://drive.google.com/file/d/1wjKZth6T0llLJ_ONPAL_6Q_jbxbAjByp/view
> >
> > Does it mean it's better now to keep it simple in the context of oom
> > patches with the plan to later reuse the generic struct_ops
> > infrastructure?
> >
> > Honestly, I believe that the simple order of attachment should be
> > good enough for quite a while, so I'd not over-complicate this,
> > unless it's not fixable later.
>
> I think the simple attachment ordering is fine. Presumably the current link list
> in patch 1 can be replaced by the mprog in the future. Other experts can chime
> in if I have missed things.

I don't think the proposed approach of:
list_for_each_entry_srcu(bpf_oom, &bpf_oom_handlers, node, false) {
is extensible without breaking things.
Sooner or later people will want bpf-oom handlers to be per
container, so we have to think upfront how to do it.
I would start with one bpf-oom prog per memcg and extend with mprog later.
Effectively placing 'struct bpf_oom_ops *' into oc->memcg,
and having one global bpf_oom_ops when oc->memcg == NULL.
I'm sure other designs are possible, but lets make sure container scope
is designed from the beginning.
mprog-like multi prog behavior per container can be added later.


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 01/14] mm: introduce bpf struct ops for OOM handling
  2025-08-26 19:52             ` Alexei Starovoitov
@ 2025-08-27 18:28               ` Roman Gushchin
  2025-09-02 17:31               ` Roman Gushchin
  1 sibling, 0 replies; 67+ messages in thread
From: Roman Gushchin @ 2025-08-27 18:28 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Martin KaFai Lau, Kumar Kartikeya Dwivedi, linux-mm, bpf,
	Suren Baghdasaryan, Johannes Weiner, Michal Hocko, David Rientjes,
	Matt Bobrowski, Song Liu, Alexei Starovoitov, Andrew Morton, LKML

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Tue, Aug 26, 2025 at 11:01 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 8/25/25 10:00 AM, Roman Gushchin wrote:
>> > Martin KaFai Lau <martin.lau@linux.dev> writes:
>> >
>> >> On 8/20/25 5:24 PM, Roman Gushchin wrote:
>> >>>> How is it decided who gets to run before the other? Is it based on
>> >>>> order of attachment (which can be non-deterministic)?
>> >>> Yeah, now it's the order of attachment.
>> >>>
>> >>>> There was a lot of discussion on something similar for tc progs, and
>> >>>> we went with specific flags that capture partial ordering constraints
>> >>>> (instead of priorities that may collide).
>> >>>> https://lore.kernel.org/all/20230719140858.13224-2-daniel@iogearbox.net
>> >>>> It would be nice if we can find a way of making this consistent.
>> >>
>> >> +1
>> >>
>> >> The cgroup bpf prog has recently added the mprog api support also. If
>> >> the simple order of attachment is not enough and needs to have
>> >> specific ordering, we should make the bpf struct_ops support the same
>> >> mprog api instead of asking each subsystem creating its own.
>> >>
>> >> fyi, another need for struct_ops ordering is to upgrade the
>> >> BPF_PROG_TYPE_SOCK_OPS api to struct_ops for easier extension in the
>> >> future. Slide 13 in
>> >> https://drive.google.com/file/d/1wjKZth6T0llLJ_ONPAL_6Q_jbxbAjByp/view
>> >
>> > Does it mean it's better now to keep it simple in the context of oom
>> > patches with the plan to later reuse the generic struct_ops
>> > infrastructure?
>> >
>> > Honestly, I believe that the simple order of attachment should be
>> > good enough for quite a while, so I'd not over-complicate this,
>> > unless it's not fixable later.
>>
>> I think the simple attachment ordering is fine. Presumably the current link list
>> in patch 1 can be replaced by the mprog in the future. Other experts can chime
>> in if I have missed things.
>
> I don't think the proposed approach of:
> list_for_each_entry_srcu(bpf_oom, &bpf_oom_handlers, node, false) {
> is extensible without breaking things.
> Sooner or later people will want bpf-oom handlers to be per
> container, so we have to think upfront how to do it.
> I would start with one bpf-oom prog per memcg and extend with mprog later.
> Effectively placing 'struct bpf_oom_ops *' into oc->memcg,
> and having one global bpf_oom_ops when oc->memcg == NULL.
> I'm sure other designs are possible, but lets make sure container scope
> is designed from the beginning.
> mprog-like multi prog behavior per container can be added later.

Sounds good to me, will implement something like this in the next version.

Thanks!


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 01/14] mm: introduce bpf struct ops for OOM handling
  2025-08-26 19:52             ` Alexei Starovoitov
  2025-08-27 18:28               ` Roman Gushchin
@ 2025-09-02 17:31               ` Roman Gushchin
  2025-09-02 22:30                 ` Martin KaFai Lau
  2025-09-03  0:29                 ` Tejun Heo
  1 sibling, 2 replies; 67+ messages in thread
From: Roman Gushchin @ 2025-09-02 17:31 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Martin KaFai Lau, Kumar Kartikeya Dwivedi, linux-mm, bpf,
	Suren Baghdasaryan, Johannes Weiner, Michal Hocko, David Rientjes,
	Matt Bobrowski, Song Liu, Alexei Starovoitov, Andrew Morton, LKML

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Tue, Aug 26, 2025 at 11:01 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 8/25/25 10:00 AM, Roman Gushchin wrote:
>> > Martin KaFai Lau <martin.lau@linux.dev> writes:
>> >
>> >> On 8/20/25 5:24 PM, Roman Gushchin wrote:
>> >>>> How is it decided who gets to run before the other? Is it based on
>> >>>> order of attachment (which can be non-deterministic)?
>> >>> Yeah, now it's the order of attachment.
>> >>>
>> >>>> There was a lot of discussion on something similar for tc progs, and
>> >>>> we went with specific flags that capture partial ordering constraints
>> >>>> (instead of priorities that may collide).
>> >>>> https://lore.kernel.org/all/20230719140858.13224-2-daniel@iogearbox.net
>> >>>> It would be nice if we can find a way of making this consistent.
>> >>
>> >> +1
>> >>
>> >> The cgroup bpf prog has recently added the mprog api support also. If
>> >> the simple order of attachment is not enough and needs to have
>> >> specific ordering, we should make the bpf struct_ops support the same
>> >> mprog api instead of asking each subsystem creating its own.
>> >>
>> >> fyi, another need for struct_ops ordering is to upgrade the
>> >> BPF_PROG_TYPE_SOCK_OPS api to struct_ops for easier extension in the
>> >> future. Slide 13 in
>> >> https://drive.google.com/file/d/1wjKZth6T0llLJ_ONPAL_6Q_jbxbAjByp/view
>> >
>> > Does it mean it's better now to keep it simple in the context of oom
>> > patches with the plan to later reuse the generic struct_ops
>> > infrastructure?
>> >
>> > Honestly, I believe that the simple order of attachment should be
>> > good enough for quite a while, so I'd not over-complicate this,
>> > unless it's not fixable later.
>>
>> I think the simple attachment ordering is fine. Presumably the current link list
>> in patch 1 can be replaced by the mprog in the future. Other experts can chime
>> in if I have missed things.
>
> I don't think the proposed approach of:
> list_for_each_entry_srcu(bpf_oom, &bpf_oom_handlers, node, false) {
> is extensible without breaking things.
> Sooner or later people will want bpf-oom handlers to be per
> container, so we have to think upfront how to do it.
> I would start with one bpf-oom prog per memcg and extend with mprog later.
> Effectively placing 'struct bpf_oom_ops *' into oc->memcg,
> and having one global bpf_oom_ops when oc->memcg == NULL.
> I'm sure other designs are possible, but lets make sure container scope
> is designed from the beginning.
> mprog-like multi prog behavior per container can be added later.

Btw, what's the right way to attach struct ops to a cgroup, if there is
one? Add a cgroup_id field to the struct and use it in the .reg()
callback? Or there is something better?

Thanks


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 01/14] mm: introduce bpf struct ops for OOM handling
  2025-09-02 17:31               ` Roman Gushchin
@ 2025-09-02 22:30                 ` Martin KaFai Lau
  2025-09-02 23:36                   ` Roman Gushchin
  2025-09-03  0:29                 ` Tejun Heo
  1 sibling, 1 reply; 67+ messages in thread
From: Martin KaFai Lau @ 2025-09-02 22:30 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Alexei Starovoitov, Kumar Kartikeya Dwivedi, linux-mm, bpf,
	Suren Baghdasaryan, Johannes Weiner, Michal Hocko, David Rientjes,
	Matt Bobrowski, Song Liu, Alexei Starovoitov, Andrew Morton, LKML

On 9/2/25 10:31 AM, Roman Gushchin wrote:
> Btw, what's the right way to attach struct ops to a cgroup, if there is
> one? Add a cgroup_id field to the struct and use it in the .reg()

Adding a cgroup id/fd field to the struct bpf_oom_ops will be hard to attach the 
same bpf_oom_ops to multiple cgroups.

> callback? Or there is something better?

There is a link_create.target_fd in the "union bpf_attr". The 
cgroup_bpf_link_attach() is using it as cgroup fd. May be it can be used here 
also. This will limit it to link attach only. Meaning the 
SEC(".struct_ops.link") is supported but not the older SEC(".struct_ops"). I 
think this should be fine.



^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 01/14] mm: introduce bpf struct ops for OOM handling
  2025-09-02 22:30                 ` Martin KaFai Lau
@ 2025-09-02 23:36                   ` Roman Gushchin
  0 siblings, 0 replies; 67+ messages in thread
From: Roman Gushchin @ 2025-09-02 23:36 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Alexei Starovoitov, Kumar Kartikeya Dwivedi, linux-mm, bpf,
	Suren Baghdasaryan, Johannes Weiner, Michal Hocko, David Rientjes,
	Matt Bobrowski, Song Liu, Alexei Starovoitov, Andrew Morton, LKML

Martin KaFai Lau <martin.lau@linux.dev> writes:

> On 9/2/25 10:31 AM, Roman Gushchin wrote:
>> Btw, what's the right way to attach struct ops to a cgroup, if there is
>> one? Add a cgroup_id field to the struct and use it in the .reg()
>
> Adding a cgroup id/fd field to the struct bpf_oom_ops will be hard to
> attach the same bpf_oom_ops to multiple cgroups.

Yeah, this is what I thought too, it doesn't look as an attractive path.

>
>> callback? Or there is something better?
>
> There is a link_create.target_fd in the "union bpf_attr". The
> cgroup_bpf_link_attach() is using it as cgroup fd. May be it can be
> used here also. This will limit it to link attach only. Meaning the
> SEC(".struct_ops.link") is supported but not the older
> SEC(".struct_ops"). I think this should be fine.

I'll take a look, thank you!


^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [PATCH v1 01/14] mm: introduce bpf struct ops for OOM handling
  2025-09-02 17:31               ` Roman Gushchin
  2025-09-02 22:30                 ` Martin KaFai Lau
@ 2025-09-03  0:29                 ` Tejun Heo
  1 sibling, 0 replies; 67+ messages in thread
From: Tejun Heo @ 2025-09-03  0:29 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Alexei Starovoitov, Martin KaFai Lau, Kumar Kartikeya Dwivedi,
	linux-mm, bpf, Suren Baghdasaryan, Johannes Weiner, Michal Hocko,
	David Rientjes, Matt Bobrowski, Song Liu, Alexei Starovoitov,
	Andrew Morton, LKML

Hello, Roman. How are you?

On Tue, Sep 02, 2025 at 10:31:33AM -0700, Roman Gushchin wrote:
...
> Btw, what's the right way to attach struct ops to a cgroup, if there is
> one? Add a cgroup_id field to the struct and use it in the .reg()
> callback? Or there is something better?

So, I'm trying to do something similar with sched_ext. Right now, I only
have a very rough prototype (I can attach multiple schedulers with warnings
and they even can schedule for several seconds before melting down).
However, the basic pieces should may still be useful. The branch is:

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git scx-hier-prototype

There are several pieces:

- cgroup recently grew lifetime notifiers that you can hook in there to
  receive on/offline events. This is useful for initializing per-cgroup
  fields and cleaning up when cgroup dies:

  https://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git/tree/kernel/sched/ext.c?h=scx-hier-prototype#n5469

- I'm passing in cgroup_id as an optional field in struct_ops and then in
  enable path, look up the matching cgroup, verify it can attach there and
  insert and update data structures accordingly:

  https://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git/tree/kernel/sched/ext.c?h=scx-hier-prototype#n5280

- I wanted to be able to group BPF programs together so that the related BPF
  timers, tracing progs and so on can call sched_ext kfuncs to operate on
  the associated scheduler instance. This currently isn't possible, so I'm
  using a really silly hack. I'm hoping we'd be able to get something better
  in the future:

  https://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git/commit/?h=scx-hier-prototype&id=b459b1f967fe1767783360761042cd36a1a5f2d6

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, other threads:[~2025-09-03  0:29 UTC | newest]

Thread overview: 67+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-18 17:01 [PATCH v1 00/14] mm: BPF OOM Roman Gushchin
2025-08-18 17:01 ` [PATCH v1 01/14] mm: introduce bpf struct ops for OOM handling Roman Gushchin
2025-08-19  4:09   ` Suren Baghdasaryan
2025-08-19 20:06     ` Roman Gushchin
2025-08-20 19:34       ` Suren Baghdasaryan
2025-08-20 19:52         ` Roman Gushchin
2025-08-20 20:01           ` Suren Baghdasaryan
2025-08-26 16:23         ` Amery Hung
2025-08-20 11:28   ` Kumar Kartikeya Dwivedi
2025-08-21  0:24     ` Roman Gushchin
2025-08-21  0:36       ` Kumar Kartikeya Dwivedi
2025-08-21  2:22         ` Roman Gushchin
2025-08-21 15:54           ` Suren Baghdasaryan
2025-08-22 19:27       ` Martin KaFai Lau
2025-08-25 17:00         ` Roman Gushchin
2025-08-26 18:01           ` Martin KaFai Lau
2025-08-26 19:52             ` Alexei Starovoitov
2025-08-27 18:28               ` Roman Gushchin
2025-09-02 17:31               ` Roman Gushchin
2025-09-02 22:30                 ` Martin KaFai Lau
2025-09-02 23:36                   ` Roman Gushchin
2025-09-03  0:29                 ` Tejun Heo
2025-08-26 16:56   ` Amery Hung
2025-08-18 17:01 ` [PATCH v1 02/14] bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL Roman Gushchin
2025-08-20  9:17   ` Kumar Kartikeya Dwivedi
2025-08-20 22:32     ` Roman Gushchin
2025-08-18 17:01 ` [PATCH v1 03/14] mm: introduce bpf_oom_kill_process() bpf kfunc Roman Gushchin
2025-08-18 17:01 ` [PATCH v1 04/14] mm: introduce bpf kfuncs to deal with memcg pointers Roman Gushchin
2025-08-20  9:21   ` Kumar Kartikeya Dwivedi
2025-08-20 22:43     ` Roman Gushchin
2025-08-20 23:33       ` Kumar Kartikeya Dwivedi
2025-08-18 17:01 ` [PATCH v1 05/14] mm: introduce bpf_get_root_mem_cgroup() bpf kfunc Roman Gushchin
2025-08-20  9:25   ` Kumar Kartikeya Dwivedi
2025-08-20 22:45     ` Roman Gushchin
2025-08-18 17:01 ` [PATCH v1 06/14] mm: introduce bpf_out_of_memory() " Roman Gushchin
2025-08-19  4:09   ` Suren Baghdasaryan
2025-08-19 20:16     ` Roman Gushchin
2025-08-20  9:34   ` Kumar Kartikeya Dwivedi
2025-08-20 22:59     ` Roman Gushchin
2025-08-18 17:01 ` [PATCH v1 07/14] mm: allow specifying custom oom constraint for bpf triggers Roman Gushchin
2025-08-18 17:01 ` [PATCH v1 08/14] mm: introduce bpf_task_is_oom_victim() kfunc Roman Gushchin
2025-08-18 17:01 ` [PATCH v1 09/14] bpf: selftests: introduce read_cgroup_file() helper Roman Gushchin
2025-08-18 17:01 ` [PATCH v1 10/14] bpf: selftests: bpf OOM handler test Roman Gushchin
2025-08-20  9:33   ` Kumar Kartikeya Dwivedi
2025-08-20 22:49     ` Roman Gushchin
2025-08-20 20:23   ` Andrii Nakryiko
2025-08-21  0:10     ` Roman Gushchin
2025-08-18 17:01 ` [PATCH v1 11/14] sched: psi: refactor psi_trigger_create() Roman Gushchin
2025-08-19  4:09   ` Suren Baghdasaryan
2025-08-19 20:28     ` Roman Gushchin
2025-08-18 17:01 ` [PATCH v1 12/14] sched: psi: implement psi trigger handling using bpf Roman Gushchin
2025-08-19  4:11   ` Suren Baghdasaryan
2025-08-19 22:31     ` Roman Gushchin
2025-08-19 23:31       ` Roman Gushchin
2025-08-20 23:56         ` Suren Baghdasaryan
2025-08-26 17:03   ` Amery Hung
2025-08-18 17:01 ` [PATCH v1 13/14] sched: psi: implement bpf_psi_create_trigger() kfunc Roman Gushchin
2025-08-20 20:30   ` Andrii Nakryiko
2025-08-21  0:36     ` Roman Gushchin
2025-08-22 19:13       ` Andrii Nakryiko
2025-08-22 19:57       ` Martin KaFai Lau
2025-08-25 16:56         ` Roman Gushchin
2025-08-18 17:01 ` [PATCH v1 14/14] bpf: selftests: psi struct ops test Roman Gushchin
2025-08-19  4:08 ` [PATCH v1 00/14] mm: BPF OOM Suren Baghdasaryan
2025-08-19 19:52   ` Roman Gushchin
2025-08-20 21:06 ` Shakeel Butt
2025-08-21  0:01   ` Roman Gushchin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).