[RFC PATCH 0/3] Memory Controller eBPF support

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/3] Memory Controller eBPF support
@ 2025-11-19  1:34 Hui Zhu
  2025-11-19  1:34 ` [RFC PATCH 1/3] memcg: add eBPF struct ops support for memory charging Hui Zhu
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Hui Zhu @ 2025-11-19  1:34 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan, Peter Zijlstra, Miguel Ojeda,
	Nathan Chancellor, Kees Cook, Tejun Heo, Jeff Xu, mkoutny,
	Jan Hendrik Farr, Christian Brauner, Randy Dunlap, Brian Gerst,
	Masahiro Yamada, linux-kernel, linux-mm, cgroups, bpf,
	linux-kselftest
  Cc: Hui Zhu

From: Hui Zhu <zhuhui@kylinos.cn>

This series proposes adding eBPF support to the Linux memory
controller, enabling dynamic and extensible memory management
policies at runtime.

Background

The memory controller (memcg) currently provides fixed memory
accounting and reclamation policies through static kernel code.
This limits flexibility for specialized workloads and use cases
that require custom memory management strategies.

By enabling eBPF programs to hook into key memory control
operations, administrators can implement custom policies without
recompiling the kernel, while maintaining the safety guarantees
provided by the BPF verifier.

Use Cases

1. Custom memory reclamation strategies for specialized workloads
2. Dynamic memory pressure monitoring and telemetry
3. Memory accounting adjustments based on runtime conditions
4. Integration with container orchestration systems for
   intelligent resource management
5. Research and experimentation with novel memory management
   algorithms

Design Overview

This series introduces:

1. A new BPF struct ops type (`memcg_ops`) that allows eBPF
   programs to implement custom behavior for memory charging
   operations.

2. A hook point in the `try_charge_memcg()` fast path that
   invokes registered eBPF programs to determine if custom
   memory management should be applied.

3. The eBPF handler can inspect memory cgroup context and
   optionally modify certain parameters (e.g., `nr_pages` for
   reclamation size).

4. A reference counting mechanism using `percpu_ref` to safely
   manage the lifecycle of registered eBPF struct ops instances.

5. Configuration via `CONFIG_MEMCG_BPF` to allow disabling this
   feature at build time.

Implementation Details

- Uses BPF struct ops for a cleaner integration model
- Leverages static branch keys for minimal overhead when feature
  is unused
- RCU synchronization ensures safe replacement of handlers
- Sample eBPF program demonstrates monitoring capabilities
- Comprehensive selftest suite validates core functionality

Performance Considerations

- Zero overhead when feature is disabled or no eBPF program is
  loaded (static branch is disabled)
- Minimal overhead when enabled: one indirect function call per
  charge attempt
- eBPF programs run under the restrictions of the BPF verifier

Patch Overview

PATCH 1/3: Core kernel implementation
  - Adds eBPF struct ops support to memcg
  - Introduces CONFIG_MEMCG_BPF option
  - Implements safe registration/unregistration mechanism

PATCH 2/3: Selftest suite
  - prog_tests/memcg_ops.c: Test entry points
  - progs/memcg_ops.bpf.c: Test eBPF program
  - Validates load, attach, and single-handler constraints

PATCH 3/3: Sample userspace program
  - samples/bpf/memcg_printk.bpf.c: Monitoring eBPF program
  - samples/bpf/memcg_printk.c: Userspace loader
  - Demonstrates real-world usage and debugging capabilities

Open Questions & Discussion Points

1. Should the eBPF handler have access to additional memory
   cgroup state? Current design exposes minimal context to
   reduce attack surface.

2. Are there other memory control operations that would benefit
   from eBPF extensibility (e.g., uncharge, reclaim)?

3. Should there be permission checks or restrictions on who can
   load memcg eBPF programs? Currently inherits BPF's
   CAP_PERFMON/CAP_SYS_ADMIN requirements.

4. How should we handle multiple eBPF programs trying to
   register? Current implementation allows only one active
   handler.

5. Is the current exposed context in `try_charge_memcg` struct
   sufficient, or should additional fields be added?

Testing

The selftests provide comprehensive coverage of the core
functionality. The sample program can be used for manual
testing and as a reference for implementing additional
monitoring tools.

Hui Zhu (3):
  memcg: add eBPF struct ops support for memory charging
  selftests/bpf: add memcg eBPF struct ops test
  samples/bpf: add example memcg eBPF program

 MAINTAINERS                                   |   5 +
 init/Kconfig                                  |  38 ++++
 mm/Makefile                                   |   1 +
 mm/memcontrol.c                               |  26 ++-
 mm/memcontrol_bpf.c                           | 200 ++++++++++++++++++
 mm/memcontrol_bpf.h                           | 103 +++++++++
 samples/bpf/Makefile                          |   2 +
 samples/bpf/memcg_printk.bpf.c                |  30 +++
 samples/bpf/memcg_printk.c                    |  82 +++++++
 .../selftests/bpf/prog_tests/memcg_ops.c      | 117 ++++++++++
 tools/testing/selftests/bpf/progs/memcg_ops.c |  20 ++
 11 files changed, 617 insertions(+), 7 deletions(-)
 create mode 100644 mm/memcontrol_bpf.c
 create mode 100644 mm/memcontrol_bpf.h
 create mode 100644 samples/bpf/memcg_printk.bpf.c
 create mode 100644 samples/bpf/memcg_printk.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/memcg_ops.c
 create mode 100644 tools/testing/selftests/bpf/progs/memcg_ops.c

-- 
2.43.0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH 1/3] memcg: add eBPF struct ops support for memory charging
  2025-11-19  1:34 [RFC PATCH 0/3] Memory Controller eBPF support Hui Zhu
@ 2025-11-19  1:34 ` Hui Zhu
  2025-11-19  2:10   ` bot+bpf-ci
                     ` (2 more replies)
  2025-11-19  1:34 ` [RFC PATCH 2/3] selftests/bpf: add memcg eBPF struct ops test Hui Zhu
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 19+ messages in thread
From: Hui Zhu @ 2025-11-19  1:34 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan, Peter Zijlstra, Miguel Ojeda,
	Nathan Chancellor, Kees Cook, Tejun Heo, Jeff Xu, mkoutny,
	Jan Hendrik Farr, Christian Brauner, Randy Dunlap, Brian Gerst,
	Masahiro Yamada, linux-kernel, linux-mm, cgroups, bpf,
	linux-kselftest
  Cc: Hui Zhu, Geliang Tang

From: Hui Zhu <zhuhui@kylinos.cn>

Add eBPF struct ops support to the memory controller, enabling
dynamic memory management policies via eBPF programs. This
allows users to implement custom memory charging and
reclamation strategies without kernel recompilation.

The implementation introduces:

- A new BPF struct ops type `memcg_ops` with a `try_charge_memcg`
  hook for intercepting memory charge operations
- Integration into the `try_charge_memcg()` function to call
  registered eBPF handlers
- Safe registration/unregistration via BPF struct ops
  infrastructure
- Reference counting using percpu_ref to track handler lifecycle
- Static branch keys to minimize overhead when disabled
- New Kconfig option CONFIG_MEMCG_BPF to control the feature

The eBPF handler receives a `try_charge_memcg` struct containing:
- Memory cgroup and affected memory cgroup
- GFP flags and page count
- Reclamation options
- Current charge status

Handlers can inspect this context and modify certain fields
(e.g., nr_pages) to adjust reclamation behavior. The design
enforces single active handler to avoid conflicts.

Use cases include:
- Custom memory policies for specialized workloads
- Memory pressure telemetry and monitoring
- Integration with container management systems
- Runtime memory management experimentation

Design decisions:
- Uses RCU synchronization for safe handler replacement
- Zero overhead when feature is disabled (via static keys)
- Single handler model prevents complexity and race conditions
- eBPF verifier restrictions ensure memory safety
- Minimal context exposure to reduce attack surface

Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
---
 MAINTAINERS         |   2 +
 init/Kconfig        |  38 +++++++++
 mm/Makefile         |   1 +
 mm/memcontrol.c     |  26 ++++--
 mm/memcontrol_bpf.c | 200 ++++++++++++++++++++++++++++++++++++++++++++
 mm/memcontrol_bpf.h | 103 +++++++++++++++++++++++
 6 files changed, 363 insertions(+), 7 deletions(-)
 create mode 100644 mm/memcontrol_bpf.c
 create mode 100644 mm/memcontrol_bpf.h

diff --git a/MAINTAINERS b/MAINTAINERS
index e64b94e6b5a9..498d01c9a48e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6351,6 +6351,8 @@ F:	include/linux/page_counter.h
 F:	mm/memcontrol.c
 F:	mm/memcontrol-v1.c
 F:	mm/memcontrol-v1.h
+F:	mm/memcontrol_bpf.c
+F:	mm/memcontrol_bpf.h
 F:	mm/page_counter.c
 F:	mm/swap_cgroup.c
 F:	samples/cgroup/*
diff --git a/init/Kconfig b/init/Kconfig
index cab3ad28ca49..cde8f5cb5ffa 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1063,6 +1063,44 @@ config MEMCG_V1
 
 	  Say N if unsure.
 
+config MEMCG_BPF
+	bool "Memory controller eBPF support"
+	depends on MEMCG
+	depends on BPF_SYSCALL
+	default n
+	help
+	  This option enables eBPF support for the memory controller,
+	  allowing eBPF programs to hook into memory charging
+	  operations and implement custom memory management policies
+	  at runtime.
+
+	  With this feature enabled, administrators can load eBPF
+	  programs to monitor and adjust memory charging behavior
+	  without recompiling the kernel. This enables:
+
+	  - Custom memory reclamation strategies for specialized
+	    workloads
+	  - Dynamic memory pressure telemetry and monitoring
+	  - Memory accounting adjustments based on runtime conditions
+	  - Integration with container orchestration systems
+	  - Experimentation with novel memory management algorithms
+
+	  The eBPF handler is invoked during memory charge attempts
+	  and can inspect memory cgroup context and optionally modify
+	  parameters like reclamation size.
+
+	  When this feature is disabled or no eBPF program is loaded,
+	  there is zero performance overhead. When enabled with an
+	  active program, overhead is minimal (one indirect function
+	  call per charge attempt). The eBPF verifier ensures memory
+	  safety of loaded programs.
+
+	  Only one eBPF program can be active at a time. Loading a
+	  new program requires appropriate BPF permissions
+	  (CAP_PERFMON or CAP_SYS_ADMIN).
+
+	  Say N if unsure.
+
 config BLK_CGROUP
 	bool "IO controller"
 	depends on BLOCK
diff --git a/mm/Makefile b/mm/Makefile
index 21abb3353550..5ac2fa7a8a74 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -102,6 +102,7 @@ obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
 obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
 obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
+obj-$(CONFIG_MEMCG_BPF) += memcontrol_bpf.o
 ifdef CONFIG_SWAP
 obj-$(CONFIG_MEMCG) += swap_cgroup.o
 endif
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4deda33625f4..104c9e9309f2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -68,6 +68,7 @@
 #include <net/ip.h>
 #include "slab.h"
 #include "memcontrol-v1.h"
+#include "memcontrol_bpf.h"
 
 #include <linux/uaccess.h>
 
@@ -2301,13 +2302,14 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	int nr_retries = MAX_RECLAIM_RETRIES;
 	struct mem_cgroup *mem_over_limit;
 	struct page_counter *counter;
-	unsigned long nr_reclaimed;
+	unsigned long nr_reclaime, nr_reclaimed;
 	bool passed_oom = false;
 	unsigned int reclaim_options = MEMCG_RECLAIM_MAY_SWAP;
 	bool drained = false;
 	bool raised_max_event = false;
 	unsigned long pflags;
 	bool allow_spinning = gfpflags_allow_spinning(gfp_mask);
+	bool charge_done = false;
 
 retry:
 	if (consume_stock(memcg, nr_pages))
@@ -2320,20 +2322,30 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (!do_memsw_account() ||
 	    page_counter_try_charge(&memcg->memsw, batch, &counter)) {
 		if (page_counter_try_charge(&memcg->memory, batch, &counter))
-			goto done_restock;
-		if (do_memsw_account())
-			page_counter_uncharge(&memcg->memsw, batch);
-		mem_over_limit = mem_cgroup_from_counter(counter, memory);
+			charge_done = true;
+		else {
+			if (do_memsw_account())
+				page_counter_uncharge(&memcg->memsw, batch);
+			mem_over_limit = mem_cgroup_from_counter(counter, memory);
+		}
 	} else {
 		mem_over_limit = mem_cgroup_from_counter(counter, memsw);
 		reclaim_options &= ~MEMCG_RECLAIM_MAY_SWAP;
 	}
 
-	if (batch > nr_pages) {
+	if (!charge_done && batch > nr_pages) {
 		batch = nr_pages;
 		goto retry;
 	}
 
+	nr_reclaime = bpf_try_charge_memcg(memcg, gfp_mask, nr_pages,
+					   mem_over_limit,
+					   reclaim_options,
+					   charge_done);
+
+	if (charge_done)
+		goto done_restock;
+
 	/*
 	 * Prevent unbounded recursion when reclaim operations need to
 	 * allocate memory. This might exceed the limits temporarily,
@@ -2353,7 +2365,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	raised_max_event = true;
 
 	psi_memstall_enter(&pflags);
-	nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
+	nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_reclaime,
 						    gfp_mask, reclaim_options, NULL);
 	psi_memstall_leave(&pflags);
 
diff --git a/mm/memcontrol_bpf.c b/mm/memcontrol_bpf.c
new file mode 100644
index 000000000000..0bdb2a147a50
--- /dev/null
+++ b/mm/memcontrol_bpf.c
@@ -0,0 +1,200 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Memory Controller eBPF support
+ *
+ * Author: Hui Zhu <zhuhui@kylinos.cn>
+ * Copyright (C) 2025 KylinSoft Corporation.
+ */
+
+#include <linux/cgroup-defs.h>
+#include <linux/page_counter.h>
+#include <linux/memcontrol.h>
+#include <linux/cgroup.h>
+#include <linux/rcupdate.h>
+#include <linux/bpf_verifier.h>
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/btf_ids.h>
+#include <linux/module.h>
+#include "memcontrol_bpf.h"
+
+struct memcg_ops __rcu *memcg_ops;
+DEFINE_STATIC_KEY_FALSE(memcg_bpf_enable);
+
+static void memcg_ops_release(struct percpu_ref *ref)
+{
+	struct memcg_ops *ops = container_of(ref,
+		struct memcg_ops, refcount);
+
+	/* Signal destruction completion */
+	complete(&ops->destroy_done);
+}
+
+static int memcg_ops_btf_struct_access(struct bpf_verifier_log *log,
+					const struct bpf_reg_state *reg,
+					int off, int size)
+{
+	size_t end;
+
+	switch (off) {
+	case offsetof(struct try_charge_memcg, nr_pages):
+		end = offsetofend(struct try_charge_memcg, nr_pages);
+		break;
+	default:
+		return -EACCES;
+	}
+
+	if (off + size > end)
+		return -EACCES;
+
+	return 0;
+}
+
+static bool memcg_ops_is_valid_access(int off, int size, enum bpf_access_type type,
+	const struct bpf_prog *prog,
+	struct bpf_insn_access_aux *info)
+{
+	return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
+}
+
+const struct bpf_verifier_ops bpf_memcg_verifier_ops = {
+	.get_func_proto = bpf_base_func_proto,
+	.btf_struct_access = memcg_ops_btf_struct_access,
+	.is_valid_access = memcg_ops_is_valid_access,
+};
+
+static int cfi_try_charge_memcg(struct try_charge_memcg *tcm)
+{
+	return -EINVAL;
+}
+
+static struct memcg_ops cfi_bpf_memcg_ops = {
+	.try_charge_memcg = cfi_try_charge_memcg,
+};
+
+static int bpf_memcg_ops_init(struct btf *btf)
+{
+	return 0;
+}
+
+static int bpf_memcg_ops_check_member(const struct btf_type *t,
+				const struct btf_member *member,
+				const struct bpf_prog *prog)
+{
+	u32 moff = __btf_member_bit_offset(t, member) / 8;
+
+	switch (moff) {
+	case offsetof(struct memcg_ops, try_charge_memcg):
+	case offsetof(struct memcg_ops, refcount):
+	case offsetof(struct memcg_ops, destroy_done):
+		break;
+	default:
+		if (prog->sleepable)
+			return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int default_try_charge_memcg(struct try_charge_memcg *tcm)
+{
+	return 0;
+}
+
+static int bpf_memcg_ops_init_member(const struct btf_type *t,
+				const struct btf_member *member,
+				void *kdata, const void *udata)
+{
+	struct memcg_ops *ops = (struct memcg_ops *)kdata;
+	u32 moff = __btf_member_bit_offset(t, member) / 8;
+	int ret;
+
+	if (moff == offsetof(struct memcg_ops, refcount)) {
+		ret = percpu_ref_init(&ops->refcount, memcg_ops_release, 0, GFP_KERNEL);
+		if (ret) {
+			pr_err("Failed to percpu_ref_init: %d\n", ret);
+			return ret;
+		}
+
+		init_completion(&ops->destroy_done);
+
+		if (!ops->try_charge_memcg)
+			ops->try_charge_memcg = default_try_charge_memcg;
+	}
+
+	return 0;
+}
+
+static int bpf_memcg_ops_reg(void *kdata, struct bpf_link *link)
+{
+	struct memcg_ops *new_ops, *old_ops;
+
+	/*
+	 * Check if ops already exists.
+	 * just get old_ops but not keep lock because
+	 * caller has locked st_map->lock.
+	 */
+	rcu_read_lock();
+	old_ops = rcu_dereference(memcg_ops);
+	rcu_read_unlock();
+	if (old_ops)
+		return -EEXIST;
+
+	new_ops = kdata;
+
+	/* Atomically set ops pointer (should be NULL at this point) */
+	old_ops = rcu_replace_pointer(memcg_ops, new_ops, true);
+	WARN_ON(old_ops);
+
+	static_branch_enable(&memcg_bpf_enable);
+
+	return 0;
+}
+
+/* Unregister the struct ops instance */
+static void bpf_memcg_ops_unreg(void *kdata, struct bpf_link *link)
+{
+	struct memcg_ops *ops;
+
+	static_branch_disable(&memcg_bpf_enable);
+
+	/* Not lock same with bpf_memcg_ops_reg. */
+	ops = rcu_replace_pointer(memcg_ops, NULL, true);
+	if (ops) {
+		synchronize_rcu();
+
+		percpu_ref_kill(&ops->refcount);
+		wait_for_completion(&ops->destroy_done);
+
+		percpu_ref_exit(&ops->refcount);
+	}
+}
+
+static struct bpf_struct_ops bpf_memcg_ops = {
+	.verifier_ops = &bpf_memcg_verifier_ops,
+	.init = bpf_memcg_ops_init,
+	.check_member = bpf_memcg_ops_check_member,
+	.init_member = bpf_memcg_ops_init_member,
+	.reg = bpf_memcg_ops_reg,
+	.unreg = bpf_memcg_ops_unreg,
+	.name = "memcg_ops",
+	.owner = THIS_MODULE,
+	.cfi_stubs = &cfi_bpf_memcg_ops,
+};
+
+static int __init memcontrol_bpf_init(void)
+{
+	int err;
+
+	RCU_INIT_POINTER(memcg_ops, NULL);
+
+	err = register_bpf_struct_ops(&bpf_memcg_ops, memcg_ops);
+	if (err) {
+		pr_warn("error while registering bpf memcg_ops: %d\n", err);
+		return err;
+	}
+
+	pr_info("bpf memcg_ops registered successfully\n");
+	return 0;
+}
+late_initcall(memcontrol_bpf_init);
diff --git a/mm/memcontrol_bpf.h b/mm/memcontrol_bpf.h
new file mode 100644
index 000000000000..ee2815fc3d05
--- /dev/null
+++ b/mm/memcontrol_bpf.h
@@ -0,0 +1,103 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/* memcontrol_bpf.h - Memory Controller eBPF support
+ *
+ * Author: Hui Zhu <zhuhui@kylinos.cn>
+ * Copyright (C) 2025 KylinSoft Corporation.
+ */
+
+#ifndef _LINUX_MEMCONTROL_BPF_H
+#define _LINUX_MEMCONTROL_BPF_H
+
+#ifdef CONFIG_MEMCG_BPF
+
+struct try_charge_memcg {
+	struct mem_cgroup *memcg;
+	gfp_t gfp_mask;
+	unsigned long nr_pages;
+	struct mem_cgroup *mem_over_limit;
+	unsigned int reclaim_options;
+	bool charge_done;
+};
+
+struct memcg_ops {
+	int (*try_charge_memcg)(struct try_charge_memcg *tcm);
+	struct percpu_ref refcount;
+	struct completion destroy_done;
+};
+
+extern struct memcg_ops __rcu *memcg_ops;
+DECLARE_STATIC_KEY_FALSE(memcg_bpf_enable);
+
+static inline struct memcg_ops *memcg_ops_get(void)
+{
+	struct memcg_ops *ops;
+
+	rcu_read_lock();
+	ops = rcu_dereference(memcg_ops);
+	if (likely(ops)) {
+		if (unlikely(!percpu_ref_tryget_live(&ops->refcount)))
+			ops = NULL;
+	}
+	rcu_read_unlock();
+
+	return ops;
+}
+
+static inline void memcg_ops_put(struct memcg_ops *ops)
+{
+	percpu_ref_put(&ops->refcount);
+}
+
+static inline unsigned long
+bpf_try_charge_memcg(struct mem_cgroup *memcg,
+		     gfp_t gfp_mask,
+		     unsigned int nr_pages,
+		     struct mem_cgroup *mem_over_limit,
+		     unsigned int reclaim_options,
+		     bool charge_done)
+{
+	struct memcg_ops *ops;
+	struct try_charge_memcg tcm;
+	int update_nr_pages;
+
+	if (likely(!static_branch_unlikely(&memcg_bpf_enable)))
+		goto out;
+
+	ops = memcg_ops_get();
+	if (unlikely(!ops))
+		goto out;
+
+	tcm.memcg = memcg;
+	tcm.gfp_mask = gfp_mask;
+	tcm.nr_pages = nr_pages;
+	tcm.mem_over_limit = mem_over_limit;
+	tcm.reclaim_options = reclaim_options;
+	tcm.charge_done = charge_done;
+
+	update_nr_pages = ops->try_charge_memcg(&tcm);
+
+	memcg_ops_put(ops);
+
+	if (update_nr_pages && !charge_done && tcm.nr_pages &&
+	    tcm.nr_pages <= HPAGE_PMD_NR)
+		nr_pages = tcm.nr_pages;
+
+out:
+	return nr_pages;
+}
+
+#else
+
+#define bpf_try_charge_memcg(memcg, gfp_mask, nr_pages, \
+			     mem_over_limit, reclaim_options, \
+			     charge_done) \
+			     ((void)memcg, \
+			      (void)gfp_mask, \
+			      nr_pages, \
+			      (void)mem_over_limit, \
+			      (void)reclaim_options, \
+			      (void)charge_done)
+
+#endif
+
+#endif
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 1/3] memcg: add eBPF struct ops support for memory charging
  2025-11-19  1:34 ` [RFC PATCH 1/3] memcg: add eBPF struct ops support for memory charging Hui Zhu
@ 2025-11-19  2:10   ` bot+bpf-ci
  2025-11-19 16:07   ` Tejun Heo
  2025-11-21 19:24   ` kernel test robot
  2 siblings, 0 replies; 19+ messages in thread
From: bot+bpf-ci @ 2025-11-19  2:10 UTC (permalink / raw)
  To: hui.zhu, akpm, hannes, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah,
	peterz, ojeda, nathan, kees, tj, jeffxu, mkoutny, kernel, brauner,
	rdunlap, brgerst, masahiroy, linux-kernel, linux-mm, cgroups, bpf,
	linux-kselftest
  Cc: zhuhui, geliang, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 3813 bytes --]

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 4deda3362..104c9e930 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -68,6 +68,7 @@
>  #include <net/ip.h>
>  #include "slab.h"
>  #include "memcontrol-v1.h"
> +#include "memcontrol_bpf.h"
>
>  #include <linux/uaccess.h>
>
> @@ -2301,13 +2302,14 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	int nr_retries = MAX_RECLAIM_RETRIES;
>  	struct mem_cgroup *mem_over_limit;
>  	struct page_counter *counter;
> -	unsigned long nr_reclaimed;
> +	unsigned long nr_reclaime, nr_reclaimed;
>  	bool passed_oom = false;
>  	unsigned int reclaim_options = MEMCG_RECLAIM_MAY_SWAP;
>  	bool drained = false;
>  	bool raised_max_event = false;
>  	unsigned long pflags;
>  	bool allow_spinning = gfpflags_allow_spinning(gfp_mask);
> +	bool charge_done = false;
>
>  retry:
>  	if (consume_stock(memcg, nr_pages))
> @@ -2320,20 +2322,30 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	if (!do_memsw_account() ||
>  	    page_counter_try_charge(&memcg->memsw, batch, &counter)) {
>  		if (page_counter_try_charge(&memcg->memory, batch, &counter))
> -			goto done_restock;
> -		if (do_memsw_account())
> -			page_counter_uncharge(&memcg->memsw, batch);
> -		mem_over_limit = mem_cgroup_from_counter(counter, memory);
> +			charge_done = true;
> +		else {
> +			if (do_memsw_account())
> +				page_counter_uncharge(&memcg->memsw, batch);
> +			mem_over_limit = mem_cgroup_from_counter(counter, memory);
> +		}
>  	} else {
>  		mem_over_limit = mem_cgroup_from_counter(counter, memsw);
>  		reclaim_options &= ~MEMCG_RECLAIM_MAY_SWAP;
>  	}
>
> -	if (batch > nr_pages) {
> +	if (!charge_done && batch > nr_pages) {
>  		batch = nr_pages;
>  		goto retry;
>  	}
>
> +	nr_reclaime = bpf_try_charge_memcg(memcg, gfp_mask, nr_pages,
> +					   mem_over_limit,
> +					   reclaim_options,
> +					   charge_done);
                                           ^^^^^^^^^^^^

When charge succeeds (charge_done = true), mem_over_limit is never
initialized but gets passed to bpf_try_charge_memcg(). Looking at
memcontrol_bpf.h, the BPF verifier allows read access to all fields
in struct try_charge_memcg via bpf_tracing_btf_ctx_access() (only
write access to nr_pages is restricted by memcg_ops_btf_struct_access).

Can the BPF program read this uninitialized pointer value? This would
leak kernel stack data to the BPF program.

> +
> +	if (charge_done)
> +		goto done_restock;

[ ... ]

> diff --git a/mm/memcontrol_bpf.h b/mm/memcontrol_bpf.h
> new file mode 100644
> index 000000000..ee2815fc3
> --- /dev/null
> +++ b/mm/memcontrol_bpf.h

[ ... ]

> +static inline unsigned long
> +bpf_try_charge_memcg(struct mem_cgroup *memcg,
> +		     gfp_t gfp_mask,
> +		     unsigned int nr_pages,
> +		     struct mem_cgroup *mem_over_limit,
> +		     unsigned int reclaim_options,
> +		     bool charge_done)
> +{
> +	struct memcg_ops *ops;
> +	struct try_charge_memcg tcm;
> +	int update_nr_pages;
> +
> +	if (likely(!static_branch_unlikely(&memcg_bpf_enable)))
> +		goto out;
> +
> +	ops = memcg_ops_get();
> +	if (unlikely(!ops))
> +		goto out;
> +
> +	tcm.memcg = memcg;
> +	tcm.gfp_mask = gfp_mask;
> +	tcm.nr_pages = nr_pages;
> +	tcm.mem_over_limit = mem_over_limit;
                              ^^^^^^^^^^^^^^^
Here mem_over_limit gets copied into the tcm struct, which is then
passed to the BPF program at the line below.

> +	tcm.reclaim_options = reclaim_options;
> +	tcm.charge_done = charge_done;
> +
> +	update_nr_pages = ops->try_charge_memcg(&tcm);

[ ... ]


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/19486904708

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 1/3] memcg: add eBPF struct ops support for memory charging
  2025-11-19  1:34 ` [RFC PATCH 1/3] memcg: add eBPF struct ops support for memory charging Hui Zhu
  2025-11-19  2:10   ` bot+bpf-ci
@ 2025-11-19 16:07   ` Tejun Heo
  2025-11-21 19:24   ` kernel test robot
  2 siblings, 0 replies; 19+ messages in thread
From: Tejun Heo @ 2025-11-19 16:07 UTC (permalink / raw)
  To: Hui Zhu
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan, Peter Zijlstra, Miguel Ojeda,
	Nathan Chancellor, Kees Cook, Jeff Xu, mkoutny, Jan Hendrik Farr,
	Christian Brauner, Randy Dunlap, Brian Gerst, Masahiro Yamada,
	linux-kernel, linux-mm, cgroups, bpf, linux-kselftest, Hui Zhu,
	Geliang Tang

Hello, Hui.

On Wed, Nov 19, 2025 at 09:34:06AM +0800, Hui Zhu wrote:
> Handlers can inspect this context and modify certain fields
> (e.g., nr_pages) to adjust reclamation behavior. The design
> enforces single active handler to avoid conflicts.

Other folks would know a lot better how this should hook into memcg, but
given that this is a cgroup feature, this should allow hierarchical
delegation. It probably would be a good idea to aling it with the bpf OOM
handler support that Roman is working on.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 1/3] memcg: add eBPF struct ops support for memory charging
  2025-11-19  1:34 ` [RFC PATCH 1/3] memcg: add eBPF struct ops support for memory charging Hui Zhu
  2025-11-19  2:10   ` bot+bpf-ci
  2025-11-19 16:07   ` Tejun Heo
@ 2025-11-21 19:24   ` kernel test robot
  2 siblings, 0 replies; 19+ messages in thread
From: kernel test robot @ 2025-11-21 19:24 UTC (permalink / raw)
  To: Hui Zhu; +Cc: llvm, oe-kbuild-all

Hi Hui,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:

[auto build test ERROR on bpf-next/net]
[also build test ERROR on bpf-next/master bpf/master linus/master v6.18-rc6 next-20251121]
[cannot apply to akpm-mm/mm-everything]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Hui-Zhu/memcg-add-eBPF-struct-ops-support-for-memory-charging/20251119-093645
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git net
patch link:    https://lore.kernel.org/r/15f95166c6c516f303f3092e74c88ace5164bdf0.1763457705.git.zhuhui%40kylinos.cn
patch subject: [RFC PATCH 1/3] memcg: add eBPF struct ops support for memory charging
config: hexagon-randconfig-001-20251121 (https://download.01.org/0day-ci/archive/20251122/202511220227.thU6cgqc-lkp@intel.com/config)
compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project 9e9fe08b16ea2c4d9867fb4974edf2a3776d6ece)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251122/202511220227.thU6cgqc-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202511220227.thU6cgqc-lkp@intel.com/

All errors (new ones prefixed by >>):

>> mm/memcontrol.c:2341:14: error: assigning to 'unsigned long' from incompatible type 'void'
    2341 |         nr_reclaime = bpf_try_charge_memcg(memcg, gfp_mask, nr_pages,
         |                     ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    2342 |                                            mem_over_limit,
         |                                            ~~~~~~~~~~~~~~~
    2343 |                                            reclaim_options,
         |                                            ~~~~~~~~~~~~~~~~
    2344 |                                            charge_done);
         |                                            ~~~~~~~~~~~~
   1 error generated.


vim +2341 mm/memcontrol.c

  2297	
  2298	static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
  2299				    unsigned int nr_pages)
  2300	{
  2301		unsigned int batch = max(MEMCG_CHARGE_BATCH, nr_pages);
  2302		int nr_retries = MAX_RECLAIM_RETRIES;
  2303		struct mem_cgroup *mem_over_limit;
  2304		struct page_counter *counter;
  2305		unsigned long nr_reclaime, nr_reclaimed;
  2306		bool passed_oom = false;
  2307		unsigned int reclaim_options = MEMCG_RECLAIM_MAY_SWAP;
  2308		bool drained = false;
  2309		bool raised_max_event = false;
  2310		unsigned long pflags;
  2311		bool allow_spinning = gfpflags_allow_spinning(gfp_mask);
  2312		bool charge_done = false;
  2313	
  2314	retry:
  2315		if (consume_stock(memcg, nr_pages))
  2316			return 0;
  2317	
  2318		if (!allow_spinning)
  2319			/* Avoid the refill and flush of the older stock */
  2320			batch = nr_pages;
  2321	
  2322		if (!do_memsw_account() ||
  2323		    page_counter_try_charge(&memcg->memsw, batch, &counter)) {
  2324			if (page_counter_try_charge(&memcg->memory, batch, &counter))
  2325				charge_done = true;
  2326			else {
  2327				if (do_memsw_account())
  2328					page_counter_uncharge(&memcg->memsw, batch);
  2329				mem_over_limit = mem_cgroup_from_counter(counter, memory);
  2330			}
  2331		} else {
  2332			mem_over_limit = mem_cgroup_from_counter(counter, memsw);
  2333			reclaim_options &= ~MEMCG_RECLAIM_MAY_SWAP;
  2334		}
  2335	
  2336		if (!charge_done && batch > nr_pages) {
  2337			batch = nr_pages;
  2338			goto retry;
  2339		}
  2340	
> 2341		nr_reclaime = bpf_try_charge_memcg(memcg, gfp_mask, nr_pages,
  2342						   mem_over_limit,
  2343						   reclaim_options,
  2344						   charge_done);
  2345	
  2346		if (charge_done)
  2347			goto done_restock;
  2348	
  2349		/*
  2350		 * Prevent unbounded recursion when reclaim operations need to
  2351		 * allocate memory. This might exceed the limits temporarily,
  2352		 * but we prefer facilitating memory reclaim and getting back
  2353		 * under the limit over triggering OOM kills in these cases.
  2354		 */
  2355		if (unlikely(current->flags & PF_MEMALLOC))
  2356			goto force;
  2357	
  2358		if (unlikely(task_in_memcg_oom(current)))
  2359			goto nomem;
  2360	
  2361		if (!gfpflags_allow_blocking(gfp_mask))
  2362			goto nomem;
  2363	
  2364		__memcg_memory_event(mem_over_limit, MEMCG_MAX, allow_spinning);
  2365		raised_max_event = true;
  2366	
  2367		psi_memstall_enter(&pflags);
  2368		nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_reclaime,
  2369							    gfp_mask, reclaim_options, NULL);
  2370		psi_memstall_leave(&pflags);
  2371	
  2372		if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
  2373			goto retry;
  2374	
  2375		if (!drained) {
  2376			drain_all_stock(mem_over_limit);
  2377			drained = true;
  2378			goto retry;
  2379		}
  2380	
  2381		if (gfp_mask & __GFP_NORETRY)
  2382			goto nomem;
  2383		/*
  2384		 * Even though the limit is exceeded at this point, reclaim
  2385		 * may have been able to free some pages.  Retry the charge
  2386		 * before killing the task.
  2387		 *
  2388		 * Only for regular pages, though: huge pages are rather
  2389		 * unlikely to succeed so close to the limit, and we fall back
  2390		 * to regular pages anyway in case of failure.
  2391		 */
  2392		if (nr_reclaimed && nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER))
  2393			goto retry;
  2394	
  2395		if (nr_retries--)
  2396			goto retry;
  2397	
  2398		if (gfp_mask & __GFP_RETRY_MAYFAIL)
  2399			goto nomem;
  2400	
  2401		/* Avoid endless loop for tasks bypassed by the oom killer */
  2402		if (passed_oom && task_is_dying())
  2403			goto nomem;
  2404	
  2405		/*
  2406		 * keep retrying as long as the memcg oom killer is able to make
  2407		 * a forward progress or bypass the charge if the oom killer
  2408		 * couldn't make any progress.
  2409		 */
  2410		if (mem_cgroup_oom(mem_over_limit, gfp_mask,
  2411				   get_order(nr_pages * PAGE_SIZE))) {
  2412			passed_oom = true;
  2413			nr_retries = MAX_RECLAIM_RETRIES;
  2414			goto retry;
  2415		}
  2416	nomem:
  2417		/*
  2418		 * Memcg doesn't have a dedicated reserve for atomic
  2419		 * allocations. But like the global atomic pool, we need to
  2420		 * put the burden of reclaim on regular allocation requests
  2421		 * and let these go through as privileged allocations.
  2422		 */
  2423		if (!(gfp_mask & (__GFP_NOFAIL | __GFP_HIGH)))
  2424			return -ENOMEM;
  2425	force:
  2426		/*
  2427		 * If the allocation has to be enforced, don't forget to raise
  2428		 * a MEMCG_MAX event.
  2429		 */
  2430		if (!raised_max_event)
  2431			__memcg_memory_event(mem_over_limit, MEMCG_MAX, allow_spinning);
  2432	
  2433		/*
  2434		 * The allocation either can't fail or will lead to more memory
  2435		 * being freed very soon.  Allow memory usage go over the limit
  2436		 * temporarily by force charging it.
  2437		 */
  2438		page_counter_charge(&memcg->memory, nr_pages);
  2439		if (do_memsw_account())
  2440			page_counter_charge(&memcg->memsw, nr_pages);
  2441	
  2442		return 0;
  2443	
  2444	done_restock:
  2445		if (batch > nr_pages)
  2446			refill_stock(memcg, batch - nr_pages);
  2447	
  2448		/*
  2449		 * If the hierarchy is above the normal consumption range, schedule
  2450		 * reclaim on returning to userland.  We can perform reclaim here
  2451		 * if __GFP_RECLAIM but let's always punt for simplicity and so that
  2452		 * GFP_KERNEL can consistently be used during reclaim.  @memcg is
  2453		 * not recorded as it most likely matches current's and won't
  2454		 * change in the meantime.  As high limit is checked again before
  2455		 * reclaim, the cost of mismatch is negligible.
  2456		 */
  2457		do {
  2458			bool mem_high, swap_high;
  2459	
  2460			mem_high = page_counter_read(&memcg->memory) >
  2461				READ_ONCE(memcg->memory.high);
  2462			swap_high = page_counter_read(&memcg->swap) >
  2463				READ_ONCE(memcg->swap.high);
  2464	
  2465			/* Don't bother a random interrupted task */
  2466			if (!in_task()) {
  2467				if (mem_high) {
  2468					schedule_work(&memcg->high_work);
  2469					break;
  2470				}
  2471				continue;
  2472			}
  2473	
  2474			if (mem_high || swap_high) {
  2475				/*
  2476				 * The allocating tasks in this cgroup will need to do
  2477				 * reclaim or be throttled to prevent further growth
  2478				 * of the memory or swap footprints.
  2479				 *
  2480				 * Target some best-effort fairness between the tasks,
  2481				 * and distribute reclaim work and delay penalties
  2482				 * based on how much each task is actually allocating.
  2483				 */
  2484				current->memcg_nr_pages_over_high += batch;
  2485				set_notify_resume(current);
  2486				break;
  2487			}
  2488		} while ((memcg = parent_mem_cgroup(memcg)));
  2489	
  2490		/*
  2491		 * Reclaim is set up above to be called from the userland
  2492		 * return path. But also attempt synchronous reclaim to avoid
  2493		 * excessive overrun while the task is still inside the
  2494		 * kernel. If this is successful, the return path will see it
  2495		 * when it rechecks the overage and simply bail out.
  2496		 */
  2497		if (current->memcg_nr_pages_over_high > MEMCG_CHARGE_BATCH &&
  2498		    !(current->flags & PF_MEMALLOC) &&
  2499		    gfpflags_allow_blocking(gfp_mask))
  2500			__mem_cgroup_handle_over_high(gfp_mask);
  2501		return 0;
  2502	}
  2503	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH 2/3] selftests/bpf: add memcg eBPF struct ops test
  2025-11-19  1:34 [RFC PATCH 0/3] Memory Controller eBPF support Hui Zhu
  2025-11-19  1:34 ` [RFC PATCH 1/3] memcg: add eBPF struct ops support for memory charging Hui Zhu
@ 2025-11-19  1:34 ` Hui Zhu
  2025-11-19  2:19   ` bot+bpf-ci
  2025-11-19  1:34 ` [RFC PATCH 3/3] samples/bpf: add example memcg eBPF program Hui Zhu
  2025-11-20  3:04 ` [RFC PATCH 0/3] Memory Controller eBPF support Roman Gushchin
  3 siblings, 1 reply; 19+ messages in thread
From: Hui Zhu @ 2025-11-19  1:34 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan, Peter Zijlstra, Miguel Ojeda,
	Nathan Chancellor, Kees Cook, Tejun Heo, Jeff Xu, mkoutny,
	Jan Hendrik Farr, Christian Brauner, Randy Dunlap, Brian Gerst,
	Masahiro Yamada, linux-kernel, linux-mm, cgroups, bpf,
	linux-kselftest
  Cc: Hui Zhu, Geliang Tang

From: Hui Zhu <zhuhui@kylinos.cn>

Add comprehensive selftest suite for memory controller eBPF
support. The tests validate the core functionality of the
memcg_ops struct ops implementation.

Test coverage includes:

1. test_memcg_ops_load: Validates that the eBPF object file
   can be successfully loaded by libbpf.

2. test_memcg_ops_attach: Tests attaching the memcg_ops struct
   ops to the kernel, verifying the basic attachment mechanism
   works correctly.

3. test_memcg_ops_double_attach: Validates that only one
   memcg_ops instance can be attached at a time. Attempts to
   attach a second program should fail, ensuring the
   single-handler design constraint is enforced.

The test suite includes:

- prog_tests/memcg_ops.c: Test entry point with individual
  test functions using standard BPF test framework helpers
  like ASSERT_OK_PTR and ASSERT_ERR_PTR
- progs/memcg_ops.bpf.c: Simple eBPF program implementing
  the struct ops interface

Uses standard test_progs framework macros for consistent error
reporting and handling. All tests properly clean up resources
in error paths.

Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
---
 MAINTAINERS                                   |   1 +
 .../selftests/bpf/prog_tests/memcg_ops.c      | 117 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/memcg_ops.c |  20 +++
 3 files changed, 138 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/memcg_ops.c
 create mode 100644 tools/testing/selftests/bpf/progs/memcg_ops.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 498d01c9a48e..dc3aa53d2346 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6356,6 +6356,7 @@ F:	mm/memcontrol_bpf.h
 F:	mm/page_counter.c
 F:	mm/swap_cgroup.c
 F:	samples/cgroup/*
+F:	tools/testing/selftests/bpf/*/memcg_ops.c
 F:	tools/testing/selftests/cgroup/memcg_protection.m
 F:	tools/testing/selftests/cgroup/test_hugetlb_memcg.c
 F:	tools/testing/selftests/cgroup/test_kmem.c
diff --git a/tools/testing/selftests/bpf/prog_tests/memcg_ops.c b/tools/testing/selftests/bpf/prog_tests/memcg_ops.c
new file mode 100644
index 000000000000..3f989bcfb8c4
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/memcg_ops.c
@@ -0,0 +1,117 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Memory controller eBPF struct ops test
+ */
+
+#include <test_progs.h>
+#include <bpf/btf.h>
+
+void test_memcg_ops_load(void)
+{
+	struct bpf_object *obj;
+	int err;
+
+	obj = bpf_object__open_file("memcg_ops.bpf.o", NULL);
+	err = libbpf_get_error(obj);
+	if (CHECK_FAIL(err)) {
+		obj = NULL;
+		goto out;
+	}
+
+	err = bpf_object__load(obj);
+	if (CHECK_FAIL(err))
+		goto out;
+
+out:
+	if (obj)
+		bpf_object__close(obj);
+}
+
+void test_memcg_ops_attach(void)
+{
+	struct bpf_object *obj;
+	struct bpf_map *map;
+	struct bpf_link *link = NULL;
+	int err;
+
+	obj = bpf_object__open_file("memcg_ops.bpf.o", NULL);
+	err = libbpf_get_error(obj);
+	if (CHECK_FAIL(err)) {
+		obj = NULL;
+		goto out;
+	}
+
+	err = bpf_object__load(obj);
+	if (CHECK_FAIL(err))
+		goto out;
+
+	map = bpf_object__find_map_by_name(obj, "mcg_ops");
+	if (!ASSERT_OK_PTR(map, "bpf_object__find_map_by_name"))
+		goto out;
+
+	link = bpf_map__attach_struct_ops(map);
+	if (!ASSERT_OK_PTR(link, "bpf_map__attach_struct_ops"))
+		goto out;
+
+out:
+	if (link)
+		bpf_link__destroy(link);
+	if (obj)
+		bpf_object__close(obj);
+}
+
+void test_memcg_ops_double_attach(void)
+{
+	struct bpf_object *obj, *obj2;
+	struct bpf_map *map, *map2;
+	struct bpf_link *link = NULL, *link2 = NULL;
+	int err;
+
+	obj = bpf_object__open_file("memcg_ops.bpf.o", NULL);
+	err = libbpf_get_error(obj);
+	if (CHECK_FAIL(err)) {
+		obj = NULL;
+		goto out;
+	}
+
+	err = bpf_object__load(obj);
+	if (CHECK_FAIL(err))
+		goto out;
+
+	map = bpf_object__find_map_by_name(obj, "mcg_ops");
+	if (!ASSERT_OK_PTR(map, "bpf_object__find_map_by_name"))
+		goto out;
+
+	link = bpf_map__attach_struct_ops(map);
+	if (!ASSERT_OK_PTR(link, "bpf_map__attach_struct_ops"))
+		goto out;
+
+	obj2 = bpf_object__open_file("memcg_ops.bpf.o", NULL);
+	err = libbpf_get_error(obj2);
+	if (CHECK_FAIL(err)) {
+		obj2 = NULL;
+		goto out;
+	}
+
+	err = bpf_object__load(obj2);
+	if (CHECK_FAIL(err))
+		goto out;
+
+	map2 = bpf_object__find_map_by_name(obj2, "mcg_ops");
+	if (!ASSERT_OK_PTR(map, "bpf_object__find_map_by_name"))
+		goto out;
+
+	link2 = bpf_map__attach_struct_ops(map2);
+	if (!ASSERT_ERR_PTR(link2, "bpf_map__attach_struct_ops")) {
+		bpf_link__destroy(link2);
+		goto out;
+	}
+
+out:
+	if (link)
+		bpf_link__destroy(link);
+	if (obj)
+		bpf_object__close(obj);
+	if (obj2)
+		bpf_object__close(obj2);
+}
diff --git a/tools/testing/selftests/bpf/progs/memcg_ops.c b/tools/testing/selftests/bpf/progs/memcg_ops.c
new file mode 100644
index 000000000000..a21fbe859fd3
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/memcg_ops.c
@@ -0,0 +1,20 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+SEC("struct_ops/try_charge_memcg")
+int BPF_PROG(test_try_charge_memcg,
+	     struct try_charge_memcg *tcm)
+{
+	return 0;
+}
+
+SEC(".struct_ops")
+struct memcg_ops mcg_ops = {
+	.try_charge_memcg = (void *)test_try_charge_memcg,
+};
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 2/3] selftests/bpf: add memcg eBPF struct ops test
  2025-11-19  1:34 ` [RFC PATCH 2/3] selftests/bpf: add memcg eBPF struct ops test Hui Zhu
@ 2025-11-19  2:19   ` bot+bpf-ci
  0 siblings, 0 replies; 19+ messages in thread
From: bot+bpf-ci @ 2025-11-19  2:19 UTC (permalink / raw)
  To: hui.zhu, akpm, hannes, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah,
	peterz, ojeda, nathan, kees, tj, jeffxu, mkoutny, kernel, brauner,
	rdunlap, brgerst, masahiroy, linux-kernel, linux-mm, cgroups, bpf,
	linux-kselftest
  Cc: zhuhui, geliang, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 2294 bytes --]

> diff --git a/tools/testing/selftests/bpf/prog_tests/memcg_ops.c b/tools/testing/selftests/bpf/prog_tests/memcg_ops.c
> new file mode 100644
> index 000000000..3f989bcfb
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/memcg_ops.c

[ ... ]

> +void test_memcg_ops_double_attach(void)
> +{
> +	struct bpf_object *obj, *obj2;
> +	struct bpf_map *map, *map2;
> +	struct bpf_link *link = NULL, *link2 = NULL;
> +	int err;
> +
> +	obj = bpf_object__open_file("memcg_ops.bpf.o", NULL);
> +	err = libbpf_get_error(obj);
> +	if (CHECK_FAIL(err)) {
> +		obj = NULL;
> +		goto out;
> +	}
> +
> +	err = bpf_object__load(obj);
> +	if (CHECK_FAIL(err))
> +		goto out;
> +
> +	map = bpf_object__find_map_by_name(obj, "mcg_ops");
> +	if (!ASSERT_OK_PTR(map, "bpf_object__find_map_by_name"))
> +		goto out;
> +
> +	link = bpf_map__attach_struct_ops(map);
> +	if (!ASSERT_OK_PTR(link, "bpf_map__attach_struct_ops"))
> +		goto out;
> +
> +	obj2 = bpf_object__open_file("memcg_ops.bpf.o", NULL);
> +	err = libbpf_get_error(obj2);
> +	if (CHECK_FAIL(err)) {
> +		obj2 = NULL;
> +		goto out;
> +	}
> +
> +	err = bpf_object__load(obj2);
> +	if (CHECK_FAIL(err))
> +		goto out;
> +
> +	map2 = bpf_object__find_map_by_name(obj2, "mcg_ops");
> +	if (!ASSERT_OK_PTR(map, "bpf_object__find_map_by_name"))
                         ^^^

In test_memcg_ops_double_attach(), the NULL check is validating the wrong
variable. The code retrieves map2 from obj2, but then checks map (from the
first object) instead of map2. This means if bpf_object__find_map_by_name()
returns NULL for map2, the test won't catch it and will proceed to use the
potentially NULL map2 pointer in the subsequent attach call.

Should this be checking map2 instead?

> +		goto out;
> +
> +	link2 = bpf_map__attach_struct_ops(map2);
> +	if (!ASSERT_ERR_PTR(link2, "bpf_map__attach_struct_ops")) {
> +		bpf_link__destroy(link2);
> +		goto out;
> +	}
> +
> +out:
> +	if (link)
> +		bpf_link__destroy(link);
> +	if (obj)
> +		bpf_object__close(obj);
> +	if (obj2)
> +		bpf_object__close(obj2);
> +}


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/19486904708

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH 3/3] samples/bpf: add example memcg eBPF program
  2025-11-19  1:34 [RFC PATCH 0/3] Memory Controller eBPF support Hui Zhu
  2025-11-19  1:34 ` [RFC PATCH 1/3] memcg: add eBPF struct ops support for memory charging Hui Zhu
  2025-11-19  1:34 ` [RFC PATCH 2/3] selftests/bpf: add memcg eBPF struct ops test Hui Zhu
@ 2025-11-19  1:34 ` Hui Zhu
  2025-11-19  2:19   ` bot+bpf-ci
  2025-11-20  3:04 ` [RFC PATCH 0/3] Memory Controller eBPF support Roman Gushchin
  3 siblings, 1 reply; 19+ messages in thread
From: Hui Zhu @ 2025-11-19  1:34 UTC (permalink / raw)
  To: Andrew Morton, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan, Peter Zijlstra, Miguel Ojeda,
	Nathan Chancellor, Kees Cook, Tejun Heo, Jeff Xu, mkoutny,
	Jan Hendrik Farr, Christian Brauner, Randy Dunlap, Brian Gerst,
	Masahiro Yamada, linux-kernel, linux-mm, cgroups, bpf,
	linux-kselftest
  Cc: Hui Zhu, Geliang Tang

From: Hui Zhu <zhuhui@kylinos.cn>

Add a sample eBPF program demonstrating the new memory
controller eBPF support. This example serves as both a reference
implementation and a validation tool for the memcg eBPF
functionality.

The sample includes:

- memcg_printk.bpf.c: An eBPF program that attaches to the
  try_charge_memcg hook and prints detailed information about
  memory charging events, including:
  * Memory cgroup name
  * GFP flags and page count
  * Reclamation options
  * Affected memory cgroup (when applicable)

- memcg_printk.c: A userspace loader program that:
  * Loads the eBPF object file
  * Finds and attaches the memcg_ops struct ops
  * Keeps the program attached until interrupted
  * Provides proper error handling and cleanup

Usage:
  $ ./samples/bpf/memcg_printk

This will attach the eBPF program to the memcg charging path.
Output can be viewed via kernel trace events (e.g.,
trace_printk logs).

The program demonstrates:
- Accessing memory cgroup context fields
- Using bpf_printk for debugging and monitoring
- Proper struct ops registration via libbpf
- Integration with the kernel's BPF infrastructure

Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
---
 MAINTAINERS                    |  2 +
 samples/bpf/Makefile           |  2 +
 samples/bpf/memcg_printk.bpf.c | 30 +++++++++++++
 samples/bpf/memcg_printk.c     | 82 ++++++++++++++++++++++++++++++++++
 4 files changed, 116 insertions(+)
 create mode 100644 samples/bpf/memcg_printk.bpf.c
 create mode 100644 samples/bpf/memcg_printk.c

diff --git a/MAINTAINERS b/MAINTAINERS
index dc3aa53d2346..c8f32f7dad3f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6356,6 +6356,8 @@ F:	mm/memcontrol_bpf.h
 F:	mm/page_counter.c
 F:	mm/swap_cgroup.c
 F:	samples/cgroup/*
+F:	samples/memcg_printk.bpf.c
+F:	samples/memcg_printk.c
 F:	tools/testing/selftests/bpf/*/memcg_ops.c
 F:	tools/testing/selftests/cgroup/memcg_protection.m
 F:	tools/testing/selftests/cgroup/test_hugetlb_memcg.c
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 95a4fa1f1e44..d50e958fc8d5 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -37,6 +37,7 @@ tprogs-y += xdp_fwd
 tprogs-y += task_fd_query
 tprogs-y += ibumad
 tprogs-y += hbm
+tprogs-y += memcg_printk
 
 # Libbpf dependencies
 LIBBPF_SRC = $(TOOLS_PATH)/lib/bpf
@@ -122,6 +123,7 @@ always-y += task_fd_query_kern.o
 always-y += ibumad_kern.o
 always-y += hbm_out_kern.o
 always-y += hbm_edt_kern.o
+always-y += memcg_printk.bpf.o
 
 COMMON_CFLAGS = $(TPROGS_USER_CFLAGS)
 TPROGS_LDFLAGS = $(TPROGS_USER_LDFLAGS)
diff --git a/samples/bpf/memcg_printk.bpf.c b/samples/bpf/memcg_printk.bpf.c
new file mode 100644
index 000000000000..66c87bf4cbcb
--- /dev/null
+++ b/samples/bpf/memcg_printk.bpf.c
@@ -0,0 +1,30 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+SEC("struct_ops/try_charge_memcg")
+int BPF_PROG(handle_try_charge_memcg, struct try_charge_memcg *tcm)
+{
+	bpf_printk(
+		"memcg %s gfp_mask 0x%x nr_pages %lu reclaim_options 0x%lx\n",
+		tcm->memcg->css.cgroup->kn->name,
+		tcm->gfp_mask,
+		tcm->nr_pages,
+		tcm->reclaim_options);
+	if (!tcm->charge_done)
+		bpf_printk("memcg %s mem_over_limit %s\n",
+			   tcm->memcg->css.cgroup->kn->name,
+			   tcm->mem_over_limit->css.cgroup->kn->name);
+
+	return 0;
+}
+
+SEC(".struct_ops")
+struct memcg_ops mcg_ops = {
+	.try_charge_memcg = (void *)handle_try_charge_memcg,
+};
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/memcg_printk.c b/samples/bpf/memcg_printk.c
new file mode 100644
index 000000000000..a2c5be2415ea
--- /dev/null
+++ b/samples/bpf/memcg_printk.c
@@ -0,0 +1,82 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#define _GNU_SOURCE
+
+#include <errno.h>
+#include <signal.h>
+#include <bpf/libbpf.h>
+
+static bool exiting;
+
+static void sig_handler(int sig)
+{
+	exiting = true;
+}
+
+static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
+{
+	return vfprintf(stderr, format, args);
+}
+
+int main(int argc, char **argv)
+{
+	struct bpf_object *obj = NULL;
+	struct bpf_link *link = NULL;
+	struct bpf_map *map;
+	char filename[256];
+	int err;
+
+	exiting = false;
+
+	signal(SIGINT, sig_handler);
+	signal(SIGTERM, sig_handler);
+
+	libbpf_set_print(libbpf_print_fn);
+
+	snprintf(filename, sizeof(filename), "%s.bpf.o", argv[0]);
+	obj = bpf_object__open_file(filename, NULL);
+	err = libbpf_get_error(obj);
+	if (err) {
+		fprintf(stderr, "Failed to open BPF object file: %d\n",
+			err);
+		obj = NULL;
+		goto cleanup;
+	}
+
+	err = bpf_object__load(obj);
+	if (err) {
+		fprintf(stderr, "Failed to loading BPF object file: %d\n",
+			err);
+		goto cleanup;
+	}
+
+	map = bpf_object__find_map_by_name(obj, "mcg_ops");
+	if (!map) {
+		fprintf(stderr, "Failed to find struct_ops map 'mcg_ops'\n");
+		err = -ENOENT;
+		goto cleanup;
+	}
+
+	link = bpf_map__attach_struct_ops(map);
+	err = libbpf_get_error(link);
+	if (err) {
+		fprintf(stderr, "Failed to attach struct ops: %d\n",
+			err);
+		link = NULL;
+		goto cleanup;
+	}
+
+	printf("Press Ctrl+C to exit...\n");
+
+	while (!exiting)
+		sleep(1);
+
+	printf("Bye!\n");
+
+cleanup:
+	if (link)
+		bpf_link__destroy(link);
+	if (obj)
+		bpf_object__close(obj);
+
+	return err;
+}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 3/3] samples/bpf: add example memcg eBPF program
  2025-11-19  1:34 ` [RFC PATCH 3/3] samples/bpf: add example memcg eBPF program Hui Zhu
@ 2025-11-19  2:19   ` bot+bpf-ci
  0 siblings, 0 replies; 19+ messages in thread
From: bot+bpf-ci @ 2025-11-19  2:19 UTC (permalink / raw)
  To: hui.zhu, akpm, hannes, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, ast, daniel, andrii, martin.lau, eddyz87, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah,
	peterz, ojeda, nathan, kees, tj, jeffxu, mkoutny, kernel, brauner,
	rdunlap, brgerst, masahiroy, linux-kernel, linux-mm, cgroups, bpf,
	linux-kselftest
  Cc: zhuhui, geliang, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 982 bytes --]

> diff --git a/MAINTAINERS b/MAINTAINERS
> index 78b414902..3a70521cf 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -6357,6 +6357,8 @@ F:	mm/memcontrol_bpf.h
>  F:	mm/page_counter.c
>  F:	mm/swap_cgroup.c
>  F:	samples/cgroup/*
> +F:	samples/memcg_printk.bpf.c
> +F:	samples/memcg_printk.c
                     ^^^^

The paths here are missing the bpf/ directory component. The actual
files are located at samples/bpf/memcg_printk.bpf.c and
samples/bpf/memcg_printk.c. Should these paths include the bpf/
directory to match where the files are actually created?

>  F:	tools/testing/selftests/bpf/*/memcg_ops.c
>  F:	tools/testing/selftests/cgroup/memcg_protection.m
>  F:	tools/testing/selftests/cgroup/test_hugetlb_memcg.c

[ ... ]


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/19486904708

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/3] Memory Controller eBPF support
  2025-11-19  1:34 [RFC PATCH 0/3] Memory Controller eBPF support Hui Zhu
                   ` (2 preceding siblings ...)
  2025-11-19  1:34 ` [RFC PATCH 3/3] samples/bpf: add example memcg eBPF program Hui Zhu
@ 2025-11-20  3:04 ` Roman Gushchin
  2025-11-20  9:29   ` hui.zhu
  3 siblings, 1 reply; 19+ messages in thread
From: Roman Gushchin @ 2025-11-20  3:04 UTC (permalink / raw)
  To: Hui Zhu
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Shakeel Butt,
	Muchun Song, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Peter Zijlstra, Miguel Ojeda, Nathan Chancellor,
	Kees Cook, Tejun Heo, Jeff Xu, mkoutny, Jan Hendrik Farr,
	Christian Brauner, Randy Dunlap, Brian Gerst, Masahiro Yamada,
	linux-kernel, linux-mm, cgroups, bpf, linux-kselftest, Hui Zhu

Hui Zhu <hui.zhu@linux.dev> writes:

> From: Hui Zhu <zhuhui@kylinos.cn>
>
> This series proposes adding eBPF support to the Linux memory
> controller, enabling dynamic and extensible memory management
> policies at runtime.
>
> Background
>
> The memory controller (memcg) currently provides fixed memory
> accounting and reclamation policies through static kernel code.
> This limits flexibility for specialized workloads and use cases
> that require custom memory management strategies.
>
> By enabling eBPF programs to hook into key memory control
> operations, administrators can implement custom policies without
> recompiling the kernel, while maintaining the safety guarantees
> provided by the BPF verifier.
>
> Use Cases
>
> 1. Custom memory reclamation strategies for specialized workloads
> 2. Dynamic memory pressure monitoring and telemetry
> 3. Memory accounting adjustments based on runtime conditions
> 4. Integration with container orchestration systems for
>    intelligent resource management
> 5. Research and experimentation with novel memory management
>    algorithms
>
> Design Overview
>
> This series introduces:
>
> 1. A new BPF struct ops type (`memcg_ops`) that allows eBPF
>    programs to implement custom behavior for memory charging
>    operations.
>
> 2. A hook point in the `try_charge_memcg()` fast path that
>    invokes registered eBPF programs to determine if custom
>    memory management should be applied.
>
> 3. The eBPF handler can inspect memory cgroup context and
>    optionally modify certain parameters (e.g., `nr_pages` for
>    reclamation size).
>
> 4. A reference counting mechanism using `percpu_ref` to safely
>    manage the lifecycle of registered eBPF struct ops instances.

Can you please describe how these hooks will be used in practice?
What's the problem you can solve with it and can't without?

I generally agree with an idea to use BPF for various memcg-related
policies, but I'm not sure how specific callbacks can be used in
practice.

Thanks!

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/3] Memory Controller eBPF support
  2025-11-20  3:04 ` [RFC PATCH 0/3] Memory Controller eBPF support Roman Gushchin
@ 2025-11-20  9:29   ` hui.zhu
  2025-11-20 19:20     ` Michal Hocko
  0 siblings, 1 reply; 19+ messages in thread
From: hui.zhu @ 2025-11-20  9:29 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Shakeel Butt,
	Muchun Song, Alexei  Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, John  Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri  Olsa, Shuah Khan, Peter Zijlstra, Miguel Ojeda,
	Nathan  Chancellor, Kees Cook, Tejun Heo, Jeff Xu, mkoutny,
	Jan  Hendrik Farr, Christian Brauner, Randy Dunlap, Brian Gerst,
	Masahiro Yamada, linux-kernel, linux-mm, cgroups, bpf,
	linux-kselftest, Hui Zhu

2025年11月20日 11:04, "Roman Gushchin" <roman.gushchin@linux.dev mailto:roman.gushchin@linux.dev?to=%22Roman%20Gushchin%22%20%3Croman.gushchin%40linux.dev%3E > 写到:

> 
> Hui Zhu <hui.zhu@linux.dev> writes:
> 
> > 
> > From: Hui Zhu <zhuhui@kylinos.cn>
> > 
> >  This series proposes adding eBPF support to the Linux memory
> >  controller, enabling dynamic and extensible memory management
> >  policies at runtime.
> > 
> >  Background
> > 
> >  The memory controller (memcg) currently provides fixed memory
> >  accounting and reclamation policies through static kernel code.
> >  This limits flexibility for specialized workloads and use cases
> >  that require custom memory management strategies.
> > 
> >  By enabling eBPF programs to hook into key memory control
> >  operations, administrators can implement custom policies without
> >  recompiling the kernel, while maintaining the safety guarantees
> >  provided by the BPF verifier.
> > 
> >  Use Cases
> > 
> >  1. Custom memory reclamation strategies for specialized workloads
> >  2. Dynamic memory pressure monitoring and telemetry
> >  3. Memory accounting adjustments based on runtime conditions
> >  4. Integration with container orchestration systems for
> >  intelligent resource management
> >  5. Research and experimentation with novel memory management
> >  algorithms
> > 
> >  Design Overview
> > 
> >  This series introduces:
> > 
> >  1. A new BPF struct ops type (`memcg_ops`) that allows eBPF
> >  programs to implement custom behavior for memory charging
> >  operations.
> > 
> >  2. A hook point in the `try_charge_memcg()` fast path that
> >  invokes registered eBPF programs to determine if custom
> >  memory management should be applied.
> > 
> >  3. The eBPF handler can inspect memory cgroup context and
> >  optionally modify certain parameters (e.g., `nr_pages` for
> >  reclamation size).
> > 
> >  4. A reference counting mechanism using `percpu_ref` to safely
> >  manage the lifecycle of registered eBPF struct ops instances.
> > 
> Can you please describe how these hooks will be used in practice?
> What's the problem you can solve with it and can't without?
> 
> I generally agree with an idea to use BPF for various memcg-related
> policies, but I'm not sure how specific callbacks can be used in
> practice.

Hi Roman,

Following are some ideas that can use ebpf memcg:

Priority‑Based Reclaim and Limits in Multi‑Tenant Environments:
On a single machine with multiple tenants / namespaces / containers,
under memory pressure it’s hard to decide “who should be squeezed first”
with static policies baked into the kernel.
Assign a BPF profile to each tenant’s memcg:
Under high global pressure, BPF can decide:
Which memcgs’ memory.high should be raised (delaying reclaim),
Which memcgs should be scanned and reclaimed more aggressively.

Online Profiling / Diagnosing Memory Hotspots:
A cgroup’s memory keeps growing, but without patching the kernel it’s
difficult to obtain fine‑grained information.
Attach BPF to the memcg charge/uncharge path:
Record large allocations (greater than N KB) with call stacks and
owning file/module, and send them to user space via a BPF ring buffer.
Based on sampled data, generate:
“Top N memory allocation stacks in this container over the last 10 minutes,”
Reports of which objects / call paths are growing fastest.
This makes it possible to pinpoint the root cause of host memory
anomalies without changing application code, which is very useful
in operations/ops scenarios.

SLO‑Driven Auto Throttling / Scale‑In/Out Signals:
Use eBPF to observe memory usage slope, frequent reclaim,
or near‑OOM behavior within a memcg.
When it decides “OOM is imminent,” instead of just killing/raising
limits, it can emit a signal to a control‑plane component.
For example, send an event to a user‑space agent to trigger
automatic scaling, QPS adjustment, or throttling.

Prevent a cgroup from launching a large‑scale fork+malloc attack:
BPF checks per‑uid or per‑cgroup allocation behavior over the
last few seconds during memcg charge.

And I maintain a software project, https://github.com/teawater/mem-agent,
for specialized memory management and related functions.
However, I found that implementing certain memory QoS categories
for memcg solely from user space is rather inefficient,
as it requires frequent access to values within memcg.
This is why I want memcg to support eBPF—so that I can place
custom memory management logic directly into the kernel using eBPF,
greatly improving efficiency.

Best,
Hui

> 
> Thanks!
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/3] Memory Controller eBPF support
  2025-11-20  9:29   ` hui.zhu
@ 2025-11-20 19:20     ` Michal Hocko
  2025-11-21  2:46       ` hui.zhu
  0 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2025-11-20 19:20 UTC (permalink / raw)
  To: hui.zhu
  Cc: Roman Gushchin, Andrew Morton, Johannes Weiner, Shakeel Butt,
	Muchun Song, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Peter Zijlstra, Miguel Ojeda, Nathan Chancellor,
	Kees Cook, Tejun Heo, Jeff Xu, mkoutny, Jan Hendrik Farr,
	Christian Brauner, Randy Dunlap, Brian Gerst, Masahiro Yamada,
	linux-kernel, linux-mm, cgroups, bpf, linux-kselftest, Hui Zhu

On Thu 20-11-25 09:29:52, hui.zhu@linux.dev wrote:
[...]
> > I generally agree with an idea to use BPF for various memcg-related
> > policies, but I'm not sure how specific callbacks can be used in
> > practice.
> 
> Hi Roman,
> 
> Following are some ideas that can use ebpf memcg:
> 
> Priority‑Based Reclaim and Limits in Multi‑Tenant Environments:
> On a single machine with multiple tenants / namespaces / containers,
> under memory pressure it’s hard to decide “who should be squeezed first”
> with static policies baked into the kernel.
> Assign a BPF profile to each tenant’s memcg:
> Under high global pressure, BPF can decide:
> Which memcgs’ memory.high should be raised (delaying reclaim),
> Which memcgs should be scanned and reclaimed more aggressively.
> 
> Online Profiling / Diagnosing Memory Hotspots:
> A cgroup’s memory keeps growing, but without patching the kernel it’s
> difficult to obtain fine‑grained information.
> Attach BPF to the memcg charge/uncharge path:
> Record large allocations (greater than N KB) with call stacks and
> owning file/module, and send them to user space via a BPF ring buffer.
> Based on sampled data, generate:
> “Top N memory allocation stacks in this container over the last 10 minutes,”
> Reports of which objects / call paths are growing fastest.
> This makes it possible to pinpoint the root cause of host memory
> anomalies without changing application code, which is very useful
> in operations/ops scenarios.
> 
> SLO‑Driven Auto Throttling / Scale‑In/Out Signals:
> Use eBPF to observe memory usage slope, frequent reclaim,
> or near‑OOM behavior within a memcg.
> When it decides “OOM is imminent,” instead of just killing/raising
> limits, it can emit a signal to a control‑plane component.
> For example, send an event to a user‑space agent to trigger
> automatic scaling, QPS adjustment, or throttling.
> 
> Prevent a cgroup from launching a large‑scale fork+malloc attack:
> BPF checks per‑uid or per‑cgroup allocation behavior over the
> last few seconds during memcg charge.

AFAIU, these are just very high level ideas rather than anything you are
trying to target with this patch series, right?

All I can see is that you add a reclaim hook but it is not really clear
to me how feasible it is to actually implement a real memory reclaim
strategy this way.

In prinicipal I am not really opposed but the memory reclaim process is
rather involved process and I would really like to see there is
something real to be done without exporting all the MM code to BPF for
any practical use. Is there any POC out there?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/3] Memory Controller eBPF support
  2025-11-20 19:20     ` Michal Hocko
@ 2025-11-21  2:46       ` hui.zhu
  2025-11-25 12:12         ` Michal Hocko
  0 siblings, 1 reply; 19+ messages in thread
From: hui.zhu @ 2025-11-21  2:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Andrew Morton, Johannes Weiner, Shakeel Butt,
	Muchun Song, Alexei Starovoitov, Daniel  Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah  Khan, Peter Zijlstra, Miguel Ojeda,
	Nathan Chancellor, Kees Cook, Tejun Heo, Jeff Xu, mkoutny,
	Jan  Hendrik Farr, Christian Brauner, Randy Dunlap, Brian  Gerst,
	Masahiro Yamada, linux-kernel, linux-mm, cgroups, bpf,
	linux-kselftest, Hui Zhu

2025年11月21日 03:20, "Michal Hocko" <mhocko@suse.com mailto:mhocko@suse.com?to=%22Michal%20Hocko%22%20%3Cmhocko%40suse.com%3E > 写到:


> 
> On Thu 20-11-25 09:29:52, hui.zhu@linux.dev wrote:
> [...]
> 
> > 
> > I generally agree with an idea to use BPF for various memcg-related
> >  policies, but I'm not sure how specific callbacks can be used in
> >  practice.
> >  
> >  Hi Roman,
> >  
> >  Following are some ideas that can use ebpf memcg:
> >  
> >  Priority‑Based Reclaim and Limits in Multi‑Tenant Environments:
> >  On a single machine with multiple tenants / namespaces / containers,
> >  under memory pressure it’s hard to decide “who should be squeezed first”
> >  with static policies baked into the kernel.
> >  Assign a BPF profile to each tenant’s memcg:
> >  Under high global pressure, BPF can decide:
> >  Which memcgs’ memory.high should be raised (delaying reclaim),
> >  Which memcgs should be scanned and reclaimed more aggressively.
> >  
> >  Online Profiling / Diagnosing Memory Hotspots:
> >  A cgroup’s memory keeps growing, but without patching the kernel it’s
> >  difficult to obtain fine‑grained information.
> >  Attach BPF to the memcg charge/uncharge path:
> >  Record large allocations (greater than N KB) with call stacks and
> >  owning file/module, and send them to user space via a BPF ring buffer.
> >  Based on sampled data, generate:
> >  “Top N memory allocation stacks in this container over the last 10 minutes,”
> >  Reports of which objects / call paths are growing fastest.
> >  This makes it possible to pinpoint the root cause of host memory
> >  anomalies without changing application code, which is very useful
> >  in operations/ops scenarios.
> >  
> >  SLO‑Driven Auto Throttling / Scale‑In/Out Signals:
> >  Use eBPF to observe memory usage slope, frequent reclaim,
> >  or near‑OOM behavior within a memcg.
> >  When it decides “OOM is imminent,” instead of just killing/raising
> >  limits, it can emit a signal to a control‑plane component.
> >  For example, send an event to a user‑space agent to trigger
> >  automatic scaling, QPS adjustment, or throttling.
> >  
> >  Prevent a cgroup from launching a large‑scale fork+malloc attack:
> >  BPF checks per‑uid or per‑cgroup allocation behavior over the
> >  last few seconds during memcg charge.
> > 
> AFAIU, these are just very high level ideas rather than anything you are
> trying to target with this patch series, right?
> 
> All I can see is that you add a reclaim hook but it is not really clear
> to me how feasible it is to actually implement a real memory reclaim
> strategy this way.
> 
> In prinicipal I am not really opposed but the memory reclaim process is
> rather involved process and I would really like to see there is
> something real to be done without exporting all the MM code to BPF for
> any practical use. Is there any POC out there?

Hi Michal,

I apologize for not delivering a more substantial POC.

I was hesitant to add extensive eBPF support to memcg
because I wasn't certain it aligned with the community's
vision—and such support would require introducing many
eBPF hooks into memcg.

I will add more eBPF hook to memcg and provide a more
meaningful POC in the next version.

Best,
Hui


> -- 
> Michal Hocko
> SUSE Labs
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/3] Memory Controller eBPF support
  2025-11-21  2:46       ` hui.zhu
@ 2025-11-25 12:12         ` Michal Hocko
  2025-11-25 12:39           ` hui.zhu
  0 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2025-11-25 12:12 UTC (permalink / raw)
  To: hui.zhu
  Cc: Roman Gushchin, Andrew Morton, Johannes Weiner, Shakeel Butt,
	Muchun Song, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Peter Zijlstra, Miguel Ojeda, Nathan Chancellor,
	Kees Cook, Tejun Heo, Jeff Xu, mkoutny, Jan Hendrik Farr,
	Christian Brauner, Randy Dunlap, Brian Gerst, Masahiro Yamada,
	linux-kernel, linux-mm, cgroups, bpf, linux-kselftest, Hui Zhu

On Fri 21-11-25 02:46:31, hui.zhu@linux.dev wrote:
> 2025年11月21日 03:20, "Michal Hocko" <mhocko@suse.com mailto:mhocko@suse.com?to=%22Michal%20Hocko%22%20%3Cmhocko%40suse.com%3E > 写到:
> 
> 
> > 
> > On Thu 20-11-25 09:29:52, hui.zhu@linux.dev wrote:
> > [...]
> > 
> > > 
> > > I generally agree with an idea to use BPF for various memcg-related
> > >  policies, but I'm not sure how specific callbacks can be used in
> > >  practice.
> > >  
> > >  Hi Roman,
> > >  
> > >  Following are some ideas that can use ebpf memcg:
> > >  
> > >  Priority‑Based Reclaim and Limits in Multi‑Tenant Environments:
> > >  On a single machine with multiple tenants / namespaces / containers,
> > >  under memory pressure it’s hard to decide “who should be squeezed first”
> > >  with static policies baked into the kernel.
> > >  Assign a BPF profile to each tenant’s memcg:
> > >  Under high global pressure, BPF can decide:
> > >  Which memcgs’ memory.high should be raised (delaying reclaim),
> > >  Which memcgs should be scanned and reclaimed more aggressively.
> > >  
> > >  Online Profiling / Diagnosing Memory Hotspots:
> > >  A cgroup’s memory keeps growing, but without patching the kernel it’s
> > >  difficult to obtain fine‑grained information.
> > >  Attach BPF to the memcg charge/uncharge path:
> > >  Record large allocations (greater than N KB) with call stacks and
> > >  owning file/module, and send them to user space via a BPF ring buffer.
> > >  Based on sampled data, generate:
> > >  “Top N memory allocation stacks in this container over the last 10 minutes,”
> > >  Reports of which objects / call paths are growing fastest.
> > >  This makes it possible to pinpoint the root cause of host memory
> > >  anomalies without changing application code, which is very useful
> > >  in operations/ops scenarios.
> > >  
> > >  SLO‑Driven Auto Throttling / Scale‑In/Out Signals:
> > >  Use eBPF to observe memory usage slope, frequent reclaim,
> > >  or near‑OOM behavior within a memcg.
> > >  When it decides “OOM is imminent,” instead of just killing/raising
> > >  limits, it can emit a signal to a control‑plane component.
> > >  For example, send an event to a user‑space agent to trigger
> > >  automatic scaling, QPS adjustment, or throttling.
> > >  
> > >  Prevent a cgroup from launching a large‑scale fork+malloc attack:
> > >  BPF checks per‑uid or per‑cgroup allocation behavior over the
> > >  last few seconds during memcg charge.
> > > 
> > AFAIU, these are just very high level ideas rather than anything you are
> > trying to target with this patch series, right?
> > 
> > All I can see is that you add a reclaim hook but it is not really clear
> > to me how feasible it is to actually implement a real memory reclaim
> > strategy this way.
> > 
> > In prinicipal I am not really opposed but the memory reclaim process is
> > rather involved process and I would really like to see there is
> > something real to be done without exporting all the MM code to BPF for
> > any practical use. Is there any POC out there?
> 
> Hi Michal,
> 
> I apologize for not delivering a more substantial POC.
> 
> I was hesitant to add extensive eBPF support to memcg
> because I wasn't certain it aligned with the community's
> vision—and such support would require introducing many
> eBPF hooks into memcg.
> 
> I will add more eBPF hook to memcg and provide a more
> meaningful POC in the next version.

Just to make sure we are on the same page. I am not suggesting we need
more of those hooks. I just want to see how many do we really need in
order to have a sensible eBPF driven reclaim policy which seems to be
the main usecase you want to puruse, right?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/3] Memory Controller eBPF support
  2025-11-25 12:12         ` Michal Hocko
@ 2025-11-25 12:39           ` hui.zhu
  2025-11-25 12:55             ` Michal Hocko
  0 siblings, 1 reply; 19+ messages in thread
From: hui.zhu @ 2025-11-25 12:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Andrew Morton, Johannes Weiner, Shakeel Butt,
	Muchun Song, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Peter Zijlstra, Miguel Ojeda, Nathan Chancellor,
	Kees Cook, Tejun Heo, Jeff Xu, mkoutny, Jan Hendrik Farr,
	Christian Brauner, Randy Dunlap, Brian Gerst, Masahiro Yamada,
	linux-kernel, linux-mm, cgroups, bpf, linux-kselftest, Hui Zhu

2025年11月25日 20:12, "Michal Hocko" <mhocko@suse.com mailto:mhocko@suse.com?to=%22Michal%20Hocko%22%20%3Cmhocko%40suse.com%3E > 写到:


> 
> On Fri 21-11-25 02:46:31, hui.zhu@linux.dev wrote:
> 
> > 
> > 2025年11月21日 03:20, "Michal Hocko" <mhocko@suse.com mailto:mhocko@suse.com?to=%22Michal%20Hocko%22%20%3Cmhocko%40suse.com%3E > 写到:
> >  
> >  
> >  
> >  On Thu 20-11-25 09:29:52, hui.zhu@linux.dev wrote:
> >  [...]
> >  
> >  > 
> >  > I generally agree with an idea to use BPF for various memcg-related
> >  > policies, but I'm not sure how specific callbacks can be used in
> >  > practice.
> >  > 
> >  > Hi Roman,
> >  > 
> >  > Following are some ideas that can use ebpf memcg:
> >  > 
> >  > Priority‑Based Reclaim and Limits in Multi‑Tenant Environments:
> >  > On a single machine with multiple tenants / namespaces / containers,
> >  > under memory pressure it’s hard to decide “who should be squeezed first”
> >  > with static policies baked into the kernel.
> >  > Assign a BPF profile to each tenant’s memcg:
> >  > Under high global pressure, BPF can decide:
> >  > Which memcgs’ memory.high should be raised (delaying reclaim),
> >  > Which memcgs should be scanned and reclaimed more aggressively.
> >  > 
> >  > Online Profiling / Diagnosing Memory Hotspots:
> >  > A cgroup’s memory keeps growing, but without patching the kernel it’s
> >  > difficult to obtain fine‑grained information.
> >  > Attach BPF to the memcg charge/uncharge path:
> >  > Record large allocations (greater than N KB) with call stacks and
> >  > owning file/module, and send them to user space via a BPF ring buffer.
> >  > Based on sampled data, generate:
> >  > “Top N memory allocation stacks in this container over the last 10 minutes,”
> >  > Reports of which objects / call paths are growing fastest.
> >  > This makes it possible to pinpoint the root cause of host memory
> >  > anomalies without changing application code, which is very useful
> >  > in operations/ops scenarios.
> >  > 
> >  > SLO‑Driven Auto Throttling / Scale‑In/Out Signals:
> >  > Use eBPF to observe memory usage slope, frequent reclaim,
> >  > or near‑OOM behavior within a memcg.
> >  > When it decides “OOM is imminent,” instead of just killing/raising
> >  > limits, it can emit a signal to a control‑plane component.
> >  > For example, send an event to a user‑space agent to trigger
> >  > automatic scaling, QPS adjustment, or throttling.
> >  > 
> >  > Prevent a cgroup from launching a large‑scale fork+malloc attack:
> >  > BPF checks per‑uid or per‑cgroup allocation behavior over the
> >  > last few seconds during memcg charge.
> >  > 
> >  AFAIU, these are just very high level ideas rather than anything you are
> >  trying to target with this patch series, right?
> >  
> >  All I can see is that you add a reclaim hook but it is not really clear
> >  to me how feasible it is to actually implement a real memory reclaim
> >  strategy this way.
> >  
> >  In prinicipal I am not really opposed but the memory reclaim process is
> >  rather involved process and I would really like to see there is
> >  something real to be done without exporting all the MM code to BPF for
> >  any practical use. Is there any POC out there?
> >  
> >  Hi Michal,
> >  
> >  I apologize for not delivering a more substantial POC.
> >  
> >  I was hesitant to add extensive eBPF support to memcg
> >  because I wasn't certain it aligned with the community's
> >  vision—and such support would require introducing many
> >  eBPF hooks into memcg.
> >  
> >  I will add more eBPF hook to memcg and provide a more
> >  meaningful POC in the next version.
> > 
> Just to make sure we are on the same page. I am not suggesting we need
> more of those hooks. I just want to see how many do we really need in
> order to have a sensible eBPF driven reclaim policy which seems to be
> the main usecase you want to puruse, right?

I got your point.

My goal is implement dynamic memory reclamation for memcgs without limits,
triggered by specific conditions.

For instance, with memcg A and memcg B both unlimited, when memcg A faces
high PSI pressure, ebpf control memcg B do some memory reclaim work when
it try charge.

Best,
Hui

> -- 
> Michal Hocko
> SUSE Labs
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/3] Memory Controller eBPF support
  2025-11-25 12:39           ` hui.zhu
@ 2025-11-25 12:55             ` Michal Hocko
  2025-11-26  3:05               ` hui.zhu
  0 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2025-11-25 12:55 UTC (permalink / raw)
  To: hui.zhu
  Cc: Roman Gushchin, Andrew Morton, Johannes Weiner, Shakeel Butt,
	Muchun Song, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Peter Zijlstra, Miguel Ojeda, Nathan Chancellor,
	Kees Cook, Tejun Heo, Jeff Xu, mkoutny, Jan Hendrik Farr,
	Christian Brauner, Randy Dunlap, Brian Gerst, Masahiro Yamada,
	linux-kernel, linux-mm, cgroups, bpf, linux-kselftest, Hui Zhu

On Tue 25-11-25 12:39:11, hui.zhu@linux.dev wrote:
> My goal is implement dynamic memory reclamation for memcgs without limits,
> triggered by specific conditions.
> 
> For instance, with memcg A and memcg B both unlimited, when memcg A faces
> high PSI pressure, ebpf control memcg B do some memory reclaim work when
> it try charge.

Understood. Please also think whether this is already possible with
existing interfaces and if not what are roadblocks in that direction.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/3] Memory Controller eBPF support
  2025-11-25 12:55             ` Michal Hocko
@ 2025-11-26  3:05               ` hui.zhu
  2025-11-26 16:01                 ` Michal Hocko
  0 siblings, 1 reply; 19+ messages in thread
From: hui.zhu @ 2025-11-26  3:05 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Andrew Morton, Johannes Weiner, Shakeel Butt,
	Muchun Song, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Peter Zijlstra, Miguel Ojeda, Nathan Chancellor,
	Kees Cook, Tejun Heo, Jeff Xu, mkoutny, Jan Hendrik Farr,
	Christian Brauner, Randy Dunlap, Brian Gerst, Masahiro Yamada,
	linux-kernel, linux-mm, cgroups, bpf, linux-kselftest, Hui Zhu

2025年11月25日 20:55, "Michal Hocko" <mhocko@suse.com mailto:mhocko@suse.com?to=%22Michal%20Hocko%22%20%3Cmhocko%40suse.com%3E > 写到:


> 
> On Tue 25-11-25 12:39:11, hui.zhu@linux.dev wrote:
> 
> > 
> > My goal is implement dynamic memory reclamation for memcgs without limits,
> >  triggered by specific conditions.
> >  
> >  For instance, with memcg A and memcg B both unlimited, when memcg A faces
> >  high PSI pressure, ebpf control memcg B do some memory reclaim work when
> >  it try charge.
> > 
> Understood. Please also think whether this is already possible with
> existing interfaces and if not what are roadblocks in that direction.

I think it's possible to implement a userspace program using the existing
PSI userspace interfaces and the control interfaces provided by memcg to
accomplish this task.
However, this approach has several limitations:
the entire process depends on the continuous execution of the userspace
program, response latency is higher, and we cannot perform fine-grained
operations on target memcg.

Now that Roman has provided PSI eBPF functionality at
https://lore.kernel.org/lkml/20251027231727.472628-1-roman.gushchin@linux.dev/
Maybe we could add eBPF support to memcg as well, allowing us to implement
the entire functionality directly in the kernel through eBPF.

Best,
Hui

> 
> Thanks!
> -- 
> Michal Hocko
> SUSE Labs
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/3] Memory Controller eBPF support
  2025-11-26  3:05               ` hui.zhu
@ 2025-11-26 16:01                 ` Michal Hocko
  2025-11-27  8:51                   ` hui.zhu
  0 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2025-11-26 16:01 UTC (permalink / raw)
  To: hui.zhu
  Cc: Roman Gushchin, Andrew Morton, Johannes Weiner, Shakeel Butt,
	Muchun Song, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Peter Zijlstra, Miguel Ojeda, Nathan Chancellor,
	Kees Cook, Tejun Heo, Jeff Xu, mkoutny, Jan Hendrik Farr,
	Christian Brauner, Randy Dunlap, Brian Gerst, Masahiro Yamada,
	linux-kernel, linux-mm, cgroups, bpf, linux-kselftest, Hui Zhu

On Wed 26-11-25 03:05:32, hui.zhu@linux.dev wrote:
> 2025年11月25日 20:55, "Michal Hocko" <mhocko@suse.com mailto:mhocko@suse.com?to=%22Michal%20Hocko%22%20%3Cmhocko%40suse.com%3E > 写到:
> 
> 
> > 
> > On Tue 25-11-25 12:39:11, hui.zhu@linux.dev wrote:
> > 
> > > 
> > > My goal is implement dynamic memory reclamation for memcgs without limits,
> > >  triggered by specific conditions.
> > >  
> > >  For instance, with memcg A and memcg B both unlimited, when memcg A faces
> > >  high PSI pressure, ebpf control memcg B do some memory reclaim work when
> > >  it try charge.
> > > 
> > Understood. Please also think whether this is already possible with
> > existing interfaces and if not what are roadblocks in that direction.
> 
> I think it's possible to implement a userspace program using the existing
> PSI userspace interfaces and the control interfaces provided by memcg to
> accomplish this task.
> However, this approach has several limitations:
> the entire process depends on the continuous execution of the userspace
> program, response latency is higher, and we cannot perform fine-grained
> operations on target memcg.

I will need to back these arguments by some actual numbers.

> Now that Roman has provided PSI eBPF functionality at
> https://lore.kernel.org/lkml/20251027231727.472628-1-roman.gushchin@linux.dev/
> Maybe we could add eBPF support to memcg as well, allowing us to implement
> the entire functionality directly in the kernel through eBPF.

His usecase is very specific to OOM handling and we have agreed that
this specific usecase is really tricky to achieve from userspace. I
haven't see sound arguments for this usecase yet.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/3] Memory Controller eBPF support
  2025-11-26 16:01                 ` Michal Hocko
@ 2025-11-27  8:51                   ` hui.zhu
  0 siblings, 0 replies; 19+ messages in thread
From: hui.zhu @ 2025-11-27  8:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Roman Gushchin, Andrew Morton, Johannes Weiner, Shakeel Butt,
	Muchun Song, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Peter Zijlstra, Miguel Ojeda, Nathan Chancellor,
	Kees Cook, Tejun Heo, Jeff Xu, mkoutny, Jan Hendrik Farr,
	Christian Brauner, Randy Dunlap, Brian Gerst, Masahiro Yamada,
	linux-kernel, linux-mm, cgroups, bpf, linux-kselftest, Hui Zhu

2025年11月27日 00:01, "Michal Hocko" <mhocko@suse.com mailto:mhocko@suse.com?to=%22Michal%20Hocko%22%20%3Cmhocko%40suse.com%3E > 写到:


> 
> On Wed 26-11-25 03:05:32, hui.zhu@linux.dev wrote:
> 
> > 
> > 2025年11月25日 20:55, "Michal Hocko" <mhocko@suse.com mailto:mhocko@suse.com?to=%22Michal%20Hocko%22%20%3Cmhocko%40suse.com%3E > 写到:
> >  
> >  
> >  
> >  On Tue 25-11-25 12:39:11, hui.zhu@linux.dev wrote:
> >  
> >  > 
> >  > My goal is implement dynamic memory reclamation for memcgs without limits,
> >  > triggered by specific conditions.
> >  > 
> >  > For instance, with memcg A and memcg B both unlimited, when memcg A faces
> >  > high PSI pressure, ebpf control memcg B do some memory reclaim work when
> >  > it try charge.
> >  > 
> >  Understood. Please also think whether this is already possible with
> >  existing interfaces and if not what are roadblocks in that direction.
> >  
> >  I think it's possible to implement a userspace program using the existing
> >  PSI userspace interfaces and the control interfaces provided by memcg to
> >  accomplish this task.
> >  However, this approach has several limitations:
> >  the entire process depends on the continuous execution of the userspace
> >  program, response latency is higher, and we cannot perform fine-grained
> >  operations on target memcg.
> > 
> I will need to back these arguments by some actual numbers.

Agree – I’ll implement a PoC show it.

Best,
Hui

> 
> > 
> > Now that Roman has provided PSI eBPF functionality at
> >  https://lore.kernel.org/lkml/20251027231727.472628-1-roman.gushchin@linux.dev/
> >  Maybe we could add eBPF support to memcg as well, allowing us to implement
> >  the entire functionality directly in the kernel through eBPF.
> > 
> His usecase is very specific to OOM handling and we have agreed that
> this specific usecase is really tricky to achieve from userspace. I
> haven't see sound arguments for this usecase yet.
> -- 
> Michal Hocko
> SUSE Labs
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2025-11-27  8:51 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-19  1:34 [RFC PATCH 0/3] Memory Controller eBPF support Hui Zhu
2025-11-19  1:34 ` [RFC PATCH 1/3] memcg: add eBPF struct ops support for memory charging Hui Zhu
2025-11-19  2:10   ` bot+bpf-ci
2025-11-19 16:07   ` Tejun Heo
2025-11-21 19:24   ` kernel test robot
2025-11-19  1:34 ` [RFC PATCH 2/3] selftests/bpf: add memcg eBPF struct ops test Hui Zhu
2025-11-19  2:19   ` bot+bpf-ci
2025-11-19  1:34 ` [RFC PATCH 3/3] samples/bpf: add example memcg eBPF program Hui Zhu
2025-11-19  2:19   ` bot+bpf-ci
2025-11-20  3:04 ` [RFC PATCH 0/3] Memory Controller eBPF support Roman Gushchin
2025-11-20  9:29   ` hui.zhu
2025-11-20 19:20     ` Michal Hocko
2025-11-21  2:46       ` hui.zhu
2025-11-25 12:12         ` Michal Hocko
2025-11-25 12:39           ` hui.zhu
2025-11-25 12:55             ` Michal Hocko
2025-11-26  3:05               ` hui.zhu
2025-11-26 16:01                 ` Michal Hocko
2025-11-27  8:51                   ` hui.zhu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.