* [RFC PATCH bpf-next v7 00/11] mm: BPF struct_ops for dynamic memory protection and async reclaim
@ 2026-05-26 2:20 Hui Zhu
2026-05-26 2:20 ` [RFC PATCH bpf-next v7 01/11] bpf: move bpf_struct_ops_link into bpf.h Hui Zhu
` (10 more replies)
0 siblings, 11 replies; 18+ messages in thread
From: Hui Zhu @ 2026-05-26 2:20 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, John Fastabend,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa,
Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Muchun Song, JP Kobryn, Andrew Morton, Shuah Khan, davem,
Jakub Kicinski, Jesper Dangaard Brouer, Stanislav Fomichev,
KP Singh, Tao Chen, Mykyta Yatsenko, Leon Hwang, Anton Protopopov,
Amery Hung, Tobias Klauser, Eyal Birger, Rong Tao, Hao Luo,
Peter Zijlstra, Miguel Ojeda, Nathan Chancellor, Kees Cook,
Tejun Heo, Jeff Xu, mkoutny, Jan Hendrik Farr, Christian Brauner,
Randy Dunlap, Brian Gerst, Masahiro Yamada, Willem de Bruijn,
Jason Xing, Paul Chaignon, Chen Ridong, Lance Yang, Jiayuan Chen,
linux-kernel, bpf, cgroups, linux-mm, netdev, linux-kselftest
Cc: geliang, baohua, Hui Zhu
From: Hui Zhu <zhuhui@kylinos.cn>
Overview:
This series introduces BPF struct_ops support for the memory controller,
enabling userspace BPF programs to implement custom, dynamic memory
management policies per cgroup. The feature allows BPF programs to hook
into the core reclaim and charge paths without requiring kernel
modifications, providing a flexible alternative to static knobs such as
memory.low and memory.min.
The series enables two complementary use cases.
Dynamic memory protection: static memory protection thresholds
(memory.low, memory.min) are poor fits for workloads whose actual memory
activity varies over time. A high-priority cgroup holding a large working
set but temporarily idle will still suppress reclaim on its siblings,
wasting available memory. A BPF-driven approach can observe real workload
activity -- page faults, charge/uncharge events -- and activate or
withdraw protection dynamically. The test results at the end of this
letter quantify the difference: in a scenario where the high-priority
cgroup is idle, the BPF-controlled low-priority cgroup achieves roughly
37x higher throughput than with static memory.low.
Asynchronous proactive reclaim: the memcg_charged and memcg_uncharged
hooks, combined with the BPF workqueue mechanism and the new
bpf_try_to_free_mem_cgroup_pages() kfunc, enable BPF programs to perform
proactive background reclaim without blocking the charge path. The
pattern works as follows: the memcg_charged callback tracks accumulated
memory usage; when usage crosses a configurable threshold, it enqueues an
asynchronous work item via bpf_wq_start() and returns immediately without
throttling the charging task. The workqueue callback then invokes
bpf_try_to_free_mem_cgroup_pages() to reclaim pages from the target
cgroup; if usage remains elevated after reclaim, the callback re-enqueues
itself to continue. This allows a BPF program to keep a cgroup's
footprint below its hard limit (memory.max) entirely in the background,
avoiding the OOM killer or direct-reclaim stalls that would otherwise
occur. The selftest for this feature (patch 10/11) validates the
mechanism concretely: a workload that writes and mmaps a 64 MB file inside
a 32 MB cgroup reliably triggers memory.events "max" events without BPF;
with the async reclaim program attached, the "max" counter does not
increase at all across the same workload.
In this patch series, I've incorporated a portion of Roman's patch in
[1] to ensure the entire series can be compiled cleanly with bpf-next.
Patch Breakdown:
Patches 1-4 are from Roman Gushchin's series [1], included here to
provide the necessary BPF infrastructure for attaching struct_ops to
cgroups.
Patches 5-11 are the new work in this series:
05/11 bpf: Pass flags in bpf_link_create for struct_ops
Stores attr->link_create.flags in struct bpf_struct_ops_link
and extends the validation to allow BPF_F_ALLOW_OVERRIDE.
Also updates the UAPI comment to reflect that cgroup-bpf attach
flags now apply to BPF_LINK_CREATE in addition to
BPF_PROG_ATTACH.
06/11 mm: memcontrol: Add BPF struct_ops for memory controller
The core feature patch. Introduces the memcg_bpf_ops struct_ops
type with the following hooks:
- memcg_charged(memcg, batch): called on the synchronous charge
path. Returns a throttling delay in milliseconds; used as a
lower bound for __mem_cgroup_handle_over_high(), effective
even when memory.high is not breached.
- memcg_uncharged(memcg, batch): called on uncharge, allowing
BPF programs to track memory releases.
- below_low(memcg, elow, usage): overrides the memory.low
protection check. Returns true to treat the cgroup as
protected regardless of the elow >= usage comparison.
- below_min(memcg, emin, usage): same as below_low but for
memory.min protection.
- handle_cgroup_online/offline(memcg): lifecycle callbacks for
per-cgroup state management in BPF programs.
BPF_F_ALLOW_OVERRIDE is supported: when a program is registered
with this flag, descendant cgroups may attach their own
memcg_bpf_ops to override the inherited policy. Registration
propagates ops down through the subtree via mem_cgroup_iter;
unregistration restores each descendant to the ops its
registering ancestor's parent held, correctly preserving
override chains.
07/11 mm/bpf: Add bpf_try_to_free_mem_cgroup_pages kfunc
Exposes try_to_free_mem_cgroup_pages() to BPF programs as a
KF_SLEEPABLE kfunc. A swappiness parameter controls the
override value passed to the core reclaim path
(effective only when MEMCG_RECLAIM_PROACTIVE is set in
reclaim_options).
08/11 selftests/bpf: Add tests for memcg_bpf_ops
Adds prog_tests/memcg_ops.c covering three scenarios:
memcg_charged-only throttling, below_low + memcg_charged
interaction, and below_min + memcg_charged interaction. A
tracepoint on memcg:count_memcg_events (PGFAULT) is used to
detect memory pressure and trigger hooks accordingly.
09/11 selftests/bpf: Add test for memcg_bpf_ops hierarchies
Validates BPF_F_ALLOW_OVERRIDE attachment semantics across a
three-level cgroup hierarchy: attach with ALLOW_OVERRIDE at the
root, override at the middle level without the flag, then assert
that attaching to the leaf correctly fails with -EBUSY.
10/11 selftests/bpf: Add selftest for memcg async reclaim via BPF
Demonstrates and validates asynchronous memory reclaim: a BPF
program uses the memcg_charged/memcg_uncharged hooks to track
accumulated usage and, when a threshold is exceeded, enqueues a
bpf_wq_start() workqueue item that calls
bpf_try_to_free_mem_cgroup_pages() without blocking the charge
path. The test asserts that with the BPF program active,
memory.events "max" events do not increase under a workload
that would otherwise exceed the hard limit.
11/11 samples/bpf: Add memcg priority control and async reclaim example
Adds a complete sample (samples/bpf/memcg.bpf.c + memcg.c)
demonstrating both features. The BPF side monitors PGFAULT
events on a high-priority cgroup; when the per-second fault
count crosses a configurable threshold, it activates below_low
or below_min protection for the high-priority cgroup and/or
applies a charge delay to the low-priority cgroup. Six
struct_ops variants are exported so userspace can attach only
the hooks needed. Async reclaim is optionally combined with
priority throttling via a shared low-cgroup ops map.
Test Environment:
The following examples run on x86_64 QEMU (10 CPUs, 2 GB RAM), using
a tmpfs-backed file on the host as a swap device to reduce I/O impact.
Two cgroups are created -- high (high-priority) and low (low-priority)
-- and each test runs two concurrent stress-ng workloads, one per
cgroup, each requesting 3 GB of memory.
# mkdir /sys/fs/cgroup/high /sys/fs/cgroup/low
# free -h
total used free shared buff/cache available
Mem: 1.9Gi 317Mi 1.6Gi 1.0Mi 144Mi 1.6Gi
Swap: 4.0Gi 0B 4.0Gi
Baseline: no memory priority policy:
Both cgroups run without any reclaim protection. Results are roughly
equal, as expected:
cgroup bogo ops/s
high 4,979
low 4,927
Test 1: memory.low protection:
Setting memory.low on the high-priority cgroup protects it from
reclaim, at the cost of pushing reclaim pressure onto the low-priority
cgroup:
# echo $((3 * 1024 * 1024 * 1024)) > /sys/fs/cgroup/high/memory.low
cgroup bogo ops/s
high 450,290
low 11,307
The high-priority cgroup benefits significantly, but memory.low relies
on static usage thresholds and cannot adapt to actual workload
behavior.
Test 2: memory.low with an idle high-priority task:
Here the high-priority cgroup runs a Python script that allocates 3 GB
and then sleeps, simulating a low-activity but memory-holding workload.
Because the process is idle, it generates no page faults and does not
actively use its memory. Yet memory.low still protects it, continuing
to suppress the low-priority cgroup's performance:
cgroup bogo ops/s
low 14,757
The low-priority cgroup remains significantly throttled despite the
high-priority cgroup being effectively idle -- a clear limitation of
static memory.low control.
Test 3: memcg eBPF -- dynamic priority control:
memcg is a sample program introduced in this patch series
(samples/bpf/memcg.c + memcg.bpf.c). It loads a BPF program that
monitors PGFAULT events in the high-priority cgroup. When the
per-second fault count exceeds a configured threshold, the hook
activates below_min protection for one second; otherwise the cgroup
receives no special treatment.
# ./memcg --low_path=/sys/fs/cgroup/low \
--high_path=/sys/fs/cgroup/high \
--threshold=1 --use_below_min
Successfully attached!
3a. Both cgroups under active memory pressure:
When both cgroups run stress-ng, the high-priority cgroup generates
frequent page faults and the BPF hook activates protection, matching
the behavior of memory.low:
cgroup bogo ops/s
high 404,392
low 11,404
3b. High-priority cgroup is idle (Python + sleep):
Because the sleeping Python process generates no page faults, the BPF
hook never activates, and the low-priority cgroup is free to reclaim
memory normally:
cgroup bogo ops/s
low 551,083
This is a ~37x improvement over the equivalent memory.low scenario
(Test 2), demonstrating that eBPF-driven dynamic control can
accurately reflect actual workload activity and avoid unnecessary
protection of idle high-priority tasks.
Summary:
Scenario low-cgroup bogo ops/s
Baseline (no policy) ~4,927
memory.low, both active ~11,307
memory.low, high idle ~14,757
memcg eBPF, both active ~11,404
memcg eBPF, high idle ~551,083
References:
[1] https://patchew.org/linux/20260127024421.494929-1-roman.gushchin@linux.dev/
Changelog:
v7:
Change base commits of "mm: BPF OOM" to v3.
Some fixes according to the comments of bpf-ci.
Rename get_high_delay_ms hook to memcg_charged; add memcg_uncharged
hook for tracking uncharge events.
Update below_low and below_min hooks to receive elow/emin and usage
as explicit arguments.
Add bpf_try_to_free_mem_cgroup_pages kfunc to expose cgroup reclaim
to BPF programs.
Add selftest for BPF-driven asynchronous page reclaim.
Extend samples/bpf/memcg to support async reclaim in addition to
priority throttling.
v6:
Based on the bot+bof-ci comments, fixed the following issues.
Added fast-path check with unlikely() before SRCU lock acquisition to
optimize the no-BPF case in BPF_MEMCG_CALL.
Add missing newline in pr_warn message to bpf_memcontrol_init.
Added comprehensive child process exit status checking with WIFEXITED()
and WEXITSTATUS(), and added zombie process prevention in
real_test_memcg_ops.
Changed malloc() to calloc() for BSS data allocation in all test
functions and samples main function.
Change srcu_read_lock(&memcg_bpf_srcu) to
lockdep_assert_held(&cgroup_mutex) in function memcontrol_bpf_online
and memcontrol_bpf_offline.
v5:
Based on the bot+bof-ci comments, fixed the following issues.
Fixed issues in memcg_ops.c and memcg.bpf.c by moving variable
declaration to the beginning of need_threshold() function.
The 'u64 current_ts' variable must be declared before any
executable statements
Improved input validation in samples/bpf/memcg.c by adding a new
parse_u64() helper function. This function properly handles errors
from strtoull() and provides better error messages when parsing
threshold and over_high_ms command-line arguments.
Move check for prog->sleepable after validating member offsets in
mm/bpf_memcontrol.c bpf_memcg_ops_check_member.
Fixed sscanf return value checking in prog_tests/memcg_ops.c.
Changed the condition from 'sscanf() < 0' to 'sscanf() != 1' because
sscanf returns the number of successfully matched items, not a negative
value on error. This makes the test more reliable when reading timing
data from temporary files.
v4:
Fix the issues according to the comments from bot+bof-ci.
According to JP Kobryn's comments, move exit(0) from
real_test_memcg_ops_child_work to real_test_memcg_ops.
Fix issues in the bpf_memcg_ops_reg function.
v3:
According to the comments from Michal Koutný and Chen Ridong, update hooks
to get_high_delay_ms, below_low, below_min, handle_cgroup_online, and
handle_cgroup_offline.
According to Michal Koutný's comments, add BPF_F_ALLOW_OVERRIDE
support to memcg_bpf_ops.
v2:
According to Tejun Heo's comments, rebased on Roman Gushchin's BPF
OOM patch series [1] and added hierarchical delegation support.
According to the comments from Roman Gushchin and Michal Hocko, designed
concrete use case scenarios and provided test results.
Hui Zhu (7):
bpf: Pass flags in bpf_link_create for struct_ops
mm: memcontrol: Add BPF struct_ops for memory controller
mm/bpf: Add bpf_try_to_free_mem_cgroup_pages kfunc
selftests/bpf: Add tests for memcg_bpf_ops
selftests/bpf: Add test for memcg_bpf_ops hierarchies
selftests/bpf: Add selftest for memcg async reclaim via BPF
samples/bpf: Add memcg priority control and async reclaim example
Roman Gushchin (4):
bpf: move bpf_struct_ops_link into bpf.h
bpf: allow attaching struct_ops to cgroups
libbpf: fix return value on memory allocation failure
libbpf: introduce bpf_map__attach_struct_ops_opts()
MAINTAINERS | 6 +
include/linux/bpf-cgroup-defs.h | 3 +
include/linux/bpf-cgroup.h | 16 +
include/linux/bpf.h | 10 +
include/linux/memcontrol.h | 250 ++++++-
include/uapi/linux/bpf.h | 5 +-
kernel/bpf/bpf_struct_ops.c | 67 +-
kernel/bpf/cgroup.c | 46 ++
mm/bpf_memcontrol.c | 355 +++++++++-
mm/memcontrol.c | 43 +-
samples/bpf/.gitignore | 1 +
samples/bpf/Makefile | 8 +-
samples/bpf/memcg.bpf.c | 380 +++++++++++
samples/bpf/memcg.c | 411 ++++++++++++
tools/include/uapi/linux/bpf.h | 3 +-
tools/lib/bpf/libbpf.c | 22 +-
tools/lib/bpf/libbpf.h | 14 +
tools/lib/bpf/libbpf.map | 1 +
tools/testing/selftests/bpf/cgroup_helpers.c | 41 ++
tools/testing/selftests/bpf/cgroup_helpers.h | 2 +
.../bpf/prog_tests/memcg_async_reclaim.c | 333 +++++++++
.../selftests/bpf/prog_tests/memcg_ops.c | 634 ++++++++++++++++++
.../selftests/bpf/progs/memcg_async_reclaim.c | 203 ++++++
tools/testing/selftests/bpf/progs/memcg_ops.c | 132 ++++
24 files changed, 2952 insertions(+), 34 deletions(-)
create mode 100644 samples/bpf/memcg.bpf.c
create mode 100644 samples/bpf/memcg.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/memcg_async_reclaim.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/memcg_ops.c
create mode 100644 tools/testing/selftests/bpf/progs/memcg_async_reclaim.c
create mode 100644 tools/testing/selftests/bpf/progs/memcg_ops.c
--
2.43.0
^ permalink raw reply [flat|nested] 18+ messages in thread
* [RFC PATCH bpf-next v7 01/11] bpf: move bpf_struct_ops_link into bpf.h
2026-05-26 2:20 [RFC PATCH bpf-next v7 00/11] mm: BPF struct_ops for dynamic memory protection and async reclaim Hui Zhu
@ 2026-05-26 2:20 ` Hui Zhu
2026-05-26 2:20 ` [RFC PATCH bpf-next v7 02/11] bpf: allow attaching struct_ops to cgroups Hui Zhu
` (9 subsequent siblings)
10 siblings, 0 replies; 18+ messages in thread
From: Hui Zhu @ 2026-05-26 2:20 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, John Fastabend,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa,
Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Muchun Song, JP Kobryn, Andrew Morton, Shuah Khan, davem,
Jakub Kicinski, Jesper Dangaard Brouer, Stanislav Fomichev,
KP Singh, Tao Chen, Mykyta Yatsenko, Leon Hwang, Anton Protopopov,
Amery Hung, Tobias Klauser, Eyal Birger, Rong Tao, Hao Luo,
Peter Zijlstra, Miguel Ojeda, Nathan Chancellor, Kees Cook,
Tejun Heo, Jeff Xu, mkoutny, Jan Hendrik Farr, Christian Brauner,
Randy Dunlap, Brian Gerst, Masahiro Yamada, Willem de Bruijn,
Jason Xing, Paul Chaignon, Chen Ridong, Lance Yang, Jiayuan Chen,
linux-kernel, bpf, cgroups, linux-mm, netdev, linux-kselftest
Cc: geliang, baohua, Matt Bobrowski, Yafang Shao
From: Roman Gushchin <roman.gushchin@linux.dev>
Move struct bpf_struct_ops_link's definition into bpf.h,
where other custom bpf links definitions are.
It's necessary to access its members from outside of generic
bpf_struct_ops implementation, which will be done by following
patches in the series.
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Matt Bobrowski <mattbobrowski@google.com>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
---
include/linux/bpf.h | 6 ++++++
kernel/bpf/bpf_struct_ops.c | 6 ------
2 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 1b28cacc3075..01c0bf5a9cd0 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1908,6 +1908,12 @@ struct bpf_raw_tp_link {
u64 cookie;
};
+struct bpf_struct_ops_link {
+ struct bpf_link link;
+ struct bpf_map __rcu *map;
+ wait_queue_head_t wait_hup;
+};
+
struct bpf_link_primer {
struct bpf_link *link;
struct file *file;
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 521cb9d7e8c7..cf3c604d48ef 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -55,12 +55,6 @@ struct bpf_struct_ops_map {
struct bpf_struct_ops_value kvalue;
};
-struct bpf_struct_ops_link {
- struct bpf_link link;
- struct bpf_map __rcu *map;
- wait_queue_head_t wait_hup;
-};
-
static DEFINE_MUTEX(update_mutex);
#define VALUE_PREFIX "bpf_struct_ops_"
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC PATCH bpf-next v7 02/11] bpf: allow attaching struct_ops to cgroups
2026-05-26 2:20 [RFC PATCH bpf-next v7 00/11] mm: BPF struct_ops for dynamic memory protection and async reclaim Hui Zhu
2026-05-26 2:20 ` [RFC PATCH bpf-next v7 01/11] bpf: move bpf_struct_ops_link into bpf.h Hui Zhu
@ 2026-05-26 2:20 ` Hui Zhu
2026-05-26 3:19 ` bot+bpf-ci
2026-05-26 2:20 ` [RFC PATCH bpf-next v7 03/11] libbpf: fix return value on memory allocation failure Hui Zhu
` (8 subsequent siblings)
10 siblings, 1 reply; 18+ messages in thread
From: Hui Zhu @ 2026-05-26 2:20 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, John Fastabend,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa,
Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Muchun Song, JP Kobryn, Andrew Morton, Shuah Khan, davem,
Jakub Kicinski, Jesper Dangaard Brouer, Stanislav Fomichev,
KP Singh, Tao Chen, Mykyta Yatsenko, Leon Hwang, Anton Protopopov,
Amery Hung, Tobias Klauser, Eyal Birger, Rong Tao, Hao Luo,
Peter Zijlstra, Miguel Ojeda, Nathan Chancellor, Kees Cook,
Tejun Heo, Jeff Xu, mkoutny, Jan Hendrik Farr, Christian Brauner,
Randy Dunlap, Brian Gerst, Masahiro Yamada, Willem de Bruijn,
Jason Xing, Paul Chaignon, Chen Ridong, Lance Yang, Jiayuan Chen,
linux-kernel, bpf, cgroups, linux-mm, netdev, linux-kselftest
Cc: geliang, baohua
From: Roman Gushchin <roman.gushchin@linux.dev>
Introduce an ability to attach bpf struct_ops'es to cgroups.
>From user's standpoint it works in the following way:
a user passes a BPF_F_CGROUP_FD flag and specifies the target cgroup
fd while creating a struct_ops link. As the result, the bpf struct_ops
link will be created and attached to a cgroup.
The cgroup.bpf structure maintains a list of attached struct ops links.
If the cgroup is getting deleted, attached struct ops'es are getting
auto-detached and the userspace program gets a notification.
This change doesn't answer the question how bpf programs belonging
to these struct ops'es will be executed. It will be done individually
for every bpf struct ops which supports this.
Please, note that unlike "normal" bpf programs, struct ops'es
are not propagated to cgroup sub-trees.
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
include/linux/bpf-cgroup-defs.h | 3 ++
include/linux/bpf-cgroup.h | 16 +++++++++
include/linux/bpf.h | 3 ++
include/uapi/linux/bpf.h | 3 ++
kernel/bpf/bpf_struct_ops.c | 59 ++++++++++++++++++++++++++++++---
kernel/bpf/cgroup.c | 46 +++++++++++++++++++++++++
tools/include/uapi/linux/bpf.h | 1 +
7 files changed, 127 insertions(+), 4 deletions(-)
diff --git a/include/linux/bpf-cgroup-defs.h b/include/linux/bpf-cgroup-defs.h
index c9e6b26abab6..6c5e37190dad 100644
--- a/include/linux/bpf-cgroup-defs.h
+++ b/include/linux/bpf-cgroup-defs.h
@@ -71,6 +71,9 @@ struct cgroup_bpf {
/* temp storage for effective prog array used by prog_attach/detach */
struct bpf_prog_array *inactive;
+ /* list of bpf struct ops links */
+ struct list_head struct_ops_links;
+
/* reference counter used to detach bpf programs after cgroup removal */
struct percpu_ref refcnt;
diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index b2e79c2b41d5..88b643568012 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -423,6 +423,11 @@ int cgroup_bpf_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
int cgroup_bpf_prog_query(const union bpf_attr *attr,
union bpf_attr __user *uattr);
+int cgroup_bpf_attach_struct_ops(struct cgroup *cgrp,
+ struct bpf_struct_ops_link *link);
+void cgroup_bpf_detach_struct_ops(struct cgroup *cgrp,
+ struct bpf_struct_ops_link *link);
+
const struct bpf_func_proto *
cgroup_common_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog);
#else
@@ -451,6 +456,17 @@ static inline int cgroup_bpf_link_attach(const union bpf_attr *attr,
return -EINVAL;
}
+static inline int cgroup_bpf_attach_struct_ops(struct cgroup *cgrp,
+ struct bpf_struct_ops_link *link)
+{
+ return -EINVAL;
+}
+
+static inline void cgroup_bpf_detach_struct_ops(struct cgroup *cgrp,
+ struct bpf_struct_ops_link *link)
+{
+}
+
static inline int cgroup_bpf_prog_query(const union bpf_attr *attr,
union bpf_attr __user *uattr)
{
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 01c0bf5a9cd0..743b4f0546b5 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1911,6 +1911,9 @@ struct bpf_raw_tp_link {
struct bpf_struct_ops_link {
struct bpf_link link;
struct bpf_map __rcu *map;
+ struct cgroup *cgroup;
+ bool cgroup_removed;
+ struct list_head list;
wait_queue_head_t wait_hup;
};
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index aec171ccb6ef..f547613986cc 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1246,6 +1246,7 @@ enum bpf_perf_event_type {
#define BPF_F_AFTER (1U << 4)
#define BPF_F_ID (1U << 5)
#define BPF_F_PREORDER (1U << 6)
+#define BPF_F_CGROUP_FD (1U << 7)
#define BPF_F_LINK BPF_F_LINK /* 1 << 13 */
/* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the
@@ -6793,6 +6794,8 @@ struct bpf_link_info {
} xdp;
struct {
__u32 map_id;
+ __u32 :32;
+ __u64 cgroup_id;
} struct_ops;
struct {
__u32 pf;
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index cf3c604d48ef..5333290957cb 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -13,6 +13,8 @@
#include <linux/btf_ids.h>
#include <linux/rcupdate_wait.h>
#include <linux/poll.h>
+#include <linux/bpf-cgroup.h>
+#include <linux/cgroup.h>
struct bpf_struct_ops_value {
struct bpf_struct_ops_common_value common;
@@ -1220,6 +1222,10 @@ static void bpf_struct_ops_map_link_dealloc(struct bpf_link *link)
st_map->st_ops_desc->st_ops->unreg(&st_map->kvalue.data, link);
bpf_map_put(&st_map->map);
}
+
+ if (st_link->cgroup)
+ cgroup_bpf_detach_struct_ops(st_link->cgroup, st_link);
+
kfree(st_link);
}
@@ -1228,6 +1234,7 @@ static void bpf_struct_ops_map_link_show_fdinfo(const struct bpf_link *link,
{
struct bpf_struct_ops_link *st_link;
struct bpf_map *map;
+ u64 cgrp_id = 0;
st_link = container_of(link, struct bpf_struct_ops_link, link);
rcu_read_lock();
@@ -1235,6 +1242,14 @@ static void bpf_struct_ops_map_link_show_fdinfo(const struct bpf_link *link,
if (map)
seq_printf(seq, "map_id:\t%d\n", map->id);
rcu_read_unlock();
+
+ cgroup_lock();
+ if (st_link->cgroup)
+ cgrp_id = cgroup_id(st_link->cgroup);
+ cgroup_unlock();
+
+ if (cgrp_id)
+ seq_printf(seq, "cgroup_id:\t%llu\n", cgrp_id);
}
static int bpf_struct_ops_map_link_fill_link_info(const struct bpf_link *link,
@@ -1242,6 +1257,7 @@ static int bpf_struct_ops_map_link_fill_link_info(const struct bpf_link *link,
{
struct bpf_struct_ops_link *st_link;
struct bpf_map *map;
+ u64 cgrp_id = 0;
st_link = container_of(link, struct bpf_struct_ops_link, link);
rcu_read_lock();
@@ -1249,6 +1265,13 @@ static int bpf_struct_ops_map_link_fill_link_info(const struct bpf_link *link,
if (map)
info->struct_ops.map_id = map->id;
rcu_read_unlock();
+
+ cgroup_lock();
+ if (st_link->cgroup)
+ cgrp_id = cgroup_id(st_link->cgroup);
+ cgroup_unlock();
+
+ info->struct_ops.cgroup_id = cgrp_id;
return 0;
}
@@ -1327,6 +1350,9 @@ static int bpf_struct_ops_map_link_detach(struct bpf_link *link)
mutex_unlock(&update_mutex);
+ if (st_link->cgroup)
+ cgroup_bpf_detach_struct_ops(st_link->cgroup, st_link);
+
wake_up_interruptible_poll(&st_link->wait_hup, EPOLLHUP);
return 0;
@@ -1339,6 +1365,9 @@ static __poll_t bpf_struct_ops_map_link_poll(struct file *file,
poll_wait(file, &st_link->wait_hup, pts);
+ if (st_link->cgroup_removed)
+ return EPOLLHUP;
+
return rcu_access_pointer(st_link->map) ? 0 : EPOLLHUP;
}
@@ -1357,8 +1386,12 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
struct bpf_link_primer link_primer;
struct bpf_struct_ops_map *st_map;
struct bpf_map *map;
+ struct cgroup *cgrp;
int err;
+ if (attr->link_create.flags & ~BPF_F_CGROUP_FD)
+ return -EINVAL;
+
map = bpf_map_get(attr->link_create.map_fd);
if (IS_ERR(map))
return PTR_ERR(map);
@@ -1378,11 +1411,26 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
bpf_link_init(&link->link, BPF_LINK_TYPE_STRUCT_OPS, &bpf_struct_ops_map_lops, NULL,
attr->link_create.attach_type);
+ init_waitqueue_head(&link->wait_hup);
+
+ if (attr->link_create.flags & BPF_F_CGROUP_FD) {
+ cgrp = cgroup_get_from_fd(attr->link_create.target_fd);
+ if (IS_ERR(cgrp)) {
+ err = PTR_ERR(cgrp);
+ goto err_out;
+ }
+ link->cgroup = cgrp;
+ err = cgroup_bpf_attach_struct_ops(cgrp, link);
+ if (err) {
+ cgroup_put(cgrp);
+ link->cgroup = NULL;
+ goto err_out;
+ }
+ }
+
err = bpf_link_prime(&link->link, &link_primer);
if (err)
- goto err_out;
-
- init_waitqueue_head(&link->wait_hup);
+ goto err_put_cgroup;
/* Hold the update_mutex such that the subsystem cannot
* do link->ops->detach() before the link is fully initialized.
@@ -1393,13 +1441,16 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
mutex_unlock(&update_mutex);
bpf_link_cleanup(&link_primer);
link = NULL;
- goto err_out;
+ goto err_put_cgroup;
}
RCU_INIT_POINTER(link->map, map);
mutex_unlock(&update_mutex);
return bpf_link_settle(&link_primer);
+err_put_cgroup:
+ if (link && link->cgroup)
+ cgroup_bpf_detach_struct_ops(link->cgroup, link);
err_out:
bpf_map_put(map);
kfree(link);
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 876f6a81a9b6..b593ebb30a4e 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -16,6 +16,7 @@
#include <linux/bpf-cgroup.h>
#include <linux/bpf_lsm.h>
#include <linux/bpf_verifier.h>
+#include <linux/poll.h>
#include <net/sock.h>
#include <net/bpf_sk_storage.h>
@@ -307,12 +308,23 @@ static void cgroup_bpf_release(struct work_struct *work)
bpf.release_work);
struct bpf_prog_array *old_array;
struct list_head *storages = &cgrp->bpf.storages;
+ struct bpf_struct_ops_link *st_link, *st_tmp;
struct bpf_cgroup_storage *storage, *stmp;
+ LIST_HEAD(st_links);
unsigned int atype;
cgroup_lock();
+ list_splice_init(&cgrp->bpf.struct_ops_links, &st_links);
+ list_for_each_entry_safe(st_link, st_tmp, &st_links, list) {
+ st_link->cgroup = NULL;
+ st_link->cgroup_removed = true;
+ cgroup_put(cgrp);
+ if (IS_ERR(bpf_link_inc_not_zero(&st_link->link)))
+ list_del(&st_link->list);
+ }
+
for (atype = 0; atype < ARRAY_SIZE(cgrp->bpf.progs); atype++) {
struct hlist_head *progs = &cgrp->bpf.progs[atype];
struct bpf_prog_list *pl;
@@ -346,6 +358,11 @@ static void cgroup_bpf_release(struct work_struct *work)
cgroup_unlock();
+ list_for_each_entry_safe(st_link, st_tmp, &st_links, list) {
+ st_link->link.ops->detach(&st_link->link);
+ bpf_link_put(&st_link->link);
+ }
+
for (p = cgroup_parent(cgrp); p; p = cgroup_parent(p))
cgroup_bpf_put(p);
@@ -525,6 +542,7 @@ static int cgroup_bpf_inherit(struct cgroup *cgrp)
INIT_HLIST_HEAD(&cgrp->bpf.progs[i]);
INIT_LIST_HEAD(&cgrp->bpf.storages);
+ INIT_LIST_HEAD(&cgrp->bpf.struct_ops_links);
for (i = 0; i < NR; i++)
if (compute_effective_progs(cgrp, i, &arrays[i]))
@@ -2755,3 +2773,31 @@ cgroup_common_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
return NULL;
}
}
+
+int cgroup_bpf_attach_struct_ops(struct cgroup *cgrp,
+ struct bpf_struct_ops_link *link)
+{
+ int ret = 0;
+
+ cgroup_lock();
+ if (percpu_ref_is_zero(&cgrp->bpf.refcnt)) {
+ ret = -EBUSY;
+ goto out;
+ }
+ list_add_tail(&link->list, &cgrp->bpf.struct_ops_links);
+out:
+ cgroup_unlock();
+ return ret;
+}
+
+void cgroup_bpf_detach_struct_ops(struct cgroup *cgrp,
+ struct bpf_struct_ops_link *link)
+{
+ cgroup_lock();
+ if (link->cgroup == cgrp) {
+ list_del(&link->list);
+ link->cgroup = NULL;
+ cgroup_put(cgrp);
+ }
+ cgroup_unlock();
+}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 37142e6d911a..fa075dc3b7eb 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1246,6 +1246,7 @@ enum bpf_perf_event_type {
#define BPF_F_AFTER (1U << 4)
#define BPF_F_ID (1U << 5)
#define BPF_F_PREORDER (1U << 6)
+#define BPF_F_CGROUP_FD (1U << 7)
#define BPF_F_LINK BPF_F_LINK /* 1 << 13 */
/* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC PATCH bpf-next v7 03/11] libbpf: fix return value on memory allocation failure
2026-05-26 2:20 [RFC PATCH bpf-next v7 00/11] mm: BPF struct_ops for dynamic memory protection and async reclaim Hui Zhu
2026-05-26 2:20 ` [RFC PATCH bpf-next v7 01/11] bpf: move bpf_struct_ops_link into bpf.h Hui Zhu
2026-05-26 2:20 ` [RFC PATCH bpf-next v7 02/11] bpf: allow attaching struct_ops to cgroups Hui Zhu
@ 2026-05-26 2:20 ` Hui Zhu
2026-05-26 3:06 ` bot+bpf-ci
2026-05-26 2:20 ` [RFC PATCH bpf-next v7 04/11] libbpf: introduce bpf_map__attach_struct_ops_opts() Hui Zhu
` (7 subsequent siblings)
10 siblings, 1 reply; 18+ messages in thread
From: Hui Zhu @ 2026-05-26 2:20 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, John Fastabend,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa,
Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Muchun Song, JP Kobryn, Andrew Morton, Shuah Khan, davem,
Jakub Kicinski, Jesper Dangaard Brouer, Stanislav Fomichev,
KP Singh, Tao Chen, Mykyta Yatsenko, Leon Hwang, Anton Protopopov,
Amery Hung, Tobias Klauser, Eyal Birger, Rong Tao, Hao Luo,
Peter Zijlstra, Miguel Ojeda, Nathan Chancellor, Kees Cook,
Tejun Heo, Jeff Xu, mkoutny, Jan Hendrik Farr, Christian Brauner,
Randy Dunlap, Brian Gerst, Masahiro Yamada, Willem de Bruijn,
Jason Xing, Paul Chaignon, Chen Ridong, Lance Yang, Jiayuan Chen,
linux-kernel, bpf, cgroups, linux-mm, netdev, linux-kselftest
Cc: geliang, baohua, Yafang Shao
From: Roman Gushchin <roman.gushchin@linux.dev>
bpf_map__attach_struct_ops() returns -EINVAL instead of -ENOMEM
on the memory allocation failure. Fix it.
Fixes: 590a00888250 ("bpf: libbpf: Add STRUCT_OPS support")
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
---
tools/lib/bpf/libbpf.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index ab2071fdd3e8..1e8688975d16 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -13701,7 +13701,7 @@ struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
link = calloc(1, sizeof(*link));
if (!link)
- return libbpf_err_ptr(-EINVAL);
+ return libbpf_err_ptr(-ENOMEM);
/* kern_vdata should be prepared during the loading phase. */
err = bpf_map_update_elem(map->fd, &zero, map->st_ops->kern_vdata, 0);
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC PATCH bpf-next v7 04/11] libbpf: introduce bpf_map__attach_struct_ops_opts()
2026-05-26 2:20 [RFC PATCH bpf-next v7 00/11] mm: BPF struct_ops for dynamic memory protection and async reclaim Hui Zhu
` (2 preceding siblings ...)
2026-05-26 2:20 ` [RFC PATCH bpf-next v7 03/11] libbpf: fix return value on memory allocation failure Hui Zhu
@ 2026-05-26 2:20 ` Hui Zhu
2026-05-26 3:06 ` bot+bpf-ci
2026-05-26 2:20 ` [RFC PATCH bpf-next v7 05/11] bpf: Pass flags in bpf_link_create for struct_ops Hui Zhu
` (6 subsequent siblings)
10 siblings, 1 reply; 18+ messages in thread
From: Hui Zhu @ 2026-05-26 2:20 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, John Fastabend,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa,
Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Muchun Song, JP Kobryn, Andrew Morton, Shuah Khan, davem,
Jakub Kicinski, Jesper Dangaard Brouer, Stanislav Fomichev,
KP Singh, Tao Chen, Mykyta Yatsenko, Leon Hwang, Anton Protopopov,
Amery Hung, Tobias Klauser, Eyal Birger, Rong Tao, Hao Luo,
Peter Zijlstra, Miguel Ojeda, Nathan Chancellor, Kees Cook,
Tejun Heo, Jeff Xu, mkoutny, Jan Hendrik Farr, Christian Brauner,
Randy Dunlap, Brian Gerst, Masahiro Yamada, Willem de Bruijn,
Jason Xing, Paul Chaignon, Chen Ridong, Lance Yang, Jiayuan Chen,
linux-kernel, bpf, cgroups, linux-mm, netdev, linux-kselftest
Cc: geliang, baohua
From: Roman Gushchin <roman.gushchin@linux.dev>
Introduce bpf_map__attach_struct_ops_opts(), an extended version of
bpf_map__attach_struct_ops(), which takes additional struct
bpf_struct_ops_opts argument.
This allows to pass a target_fd argument and the BPF_F_CGROUP_FD flag
and attach the struct ops to a cgroup as a result.
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
tools/lib/bpf/libbpf.c | 20 +++++++++++++++++---
tools/lib/bpf/libbpf.h | 14 ++++++++++++++
tools/lib/bpf/libbpf.map | 1 +
3 files changed, 32 insertions(+), 3 deletions(-)
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 1e8688975d16..a1b54da1ded2 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -13683,11 +13683,18 @@ static int bpf_link__detach_struct_ops(struct bpf_link *link)
return close(link->fd);
}
-struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
+struct bpf_link *bpf_map__attach_struct_ops_opts(const struct bpf_map *map,
+ const struct bpf_struct_ops_opts *opts)
{
+ DECLARE_LIBBPF_OPTS(bpf_link_create_opts, link_opts);
struct bpf_link_struct_ops *link;
+ int err, fd, target_fd;
__u32 zero = 0;
- int err, fd;
+
+ if (!OPTS_VALID(opts, bpf_struct_ops_opts)) {
+ pr_warn("map '%s': invalid opts\n", map->name);
+ return libbpf_err_ptr(-EINVAL);
+ }
if (!bpf_map__is_struct_ops(map)) {
pr_warn("map '%s': can't attach non-struct_ops map\n", map->name);
@@ -13724,7 +13731,9 @@ struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
return &link->link;
}
- fd = bpf_link_create(map->fd, 0, BPF_STRUCT_OPS, NULL);
+ link_opts.flags = OPTS_GET(opts, flags, 0);
+ target_fd = OPTS_GET(opts, target_fd, 0);
+ fd = bpf_link_create(map->fd, target_fd, BPF_STRUCT_OPS, &link_opts);
if (fd < 0) {
free(link);
return libbpf_err_ptr(fd);
@@ -13736,6 +13745,11 @@ struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
return &link->link;
}
+struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
+{
+ return bpf_map__attach_struct_ops_opts(map, NULL);
+}
+
/*
* Swap the back struct_ops of a link with a new struct_ops map.
*/
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index bba4e8464396..18af178547ad 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -945,6 +945,20 @@ bpf_program__attach_cgroup_opts(const struct bpf_program *prog, int cgroup_fd,
struct bpf_map;
LIBBPF_API struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map);
+
+struct bpf_struct_ops_opts {
+ /* size of this struct, for forward/backward compatibility */
+ size_t sz;
+ __u32 flags;
+ __u32 target_fd;
+ __u64 expected_revision;
+ size_t :0;
+};
+#define bpf_struct_ops_opts__last_field expected_revision
+
+LIBBPF_API struct bpf_link *
+bpf_map__attach_struct_ops_opts(const struct bpf_map *map,
+ const struct bpf_struct_ops_opts *opts);
LIBBPF_API int bpf_link__update_map(struct bpf_link *link, const struct bpf_map *map);
struct bpf_iter_attach_opts {
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index dfed8d60af05..6105619b5ecf 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -454,6 +454,7 @@ LIBBPF_1.7.0 {
bpf_prog_assoc_struct_ops;
bpf_program__assoc_struct_ops;
btf__permute;
+ bpf_map__attach_struct_ops_opts;
} LIBBPF_1.6.0;
LIBBPF_1.8.0 {
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC PATCH bpf-next v7 05/11] bpf: Pass flags in bpf_link_create for struct_ops
2026-05-26 2:20 [RFC PATCH bpf-next v7 00/11] mm: BPF struct_ops for dynamic memory protection and async reclaim Hui Zhu
` (3 preceding siblings ...)
2026-05-26 2:20 ` [RFC PATCH bpf-next v7 04/11] libbpf: introduce bpf_map__attach_struct_ops_opts() Hui Zhu
@ 2026-05-26 2:20 ` Hui Zhu
2026-05-26 2:24 ` [RFC PATCH bpf-next v7 06/11] mm: memcontrol: Add BPF struct_ops for memory controller Hui Zhu
` (5 subsequent siblings)
10 siblings, 0 replies; 18+ messages in thread
From: Hui Zhu @ 2026-05-26 2:20 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, John Fastabend,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa,
Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Muchun Song, JP Kobryn, Andrew Morton, Shuah Khan, davem,
Jakub Kicinski, Jesper Dangaard Brouer, Stanislav Fomichev,
KP Singh, Tao Chen, Mykyta Yatsenko, Leon Hwang, Anton Protopopov,
Amery Hung, Tobias Klauser, Eyal Birger, Rong Tao, Hao Luo,
Peter Zijlstra, Miguel Ojeda, Nathan Chancellor, Kees Cook,
Tejun Heo, Jeff Xu, mkoutny, Jan Hendrik Farr, Christian Brauner,
Randy Dunlap, Brian Gerst, Masahiro Yamada, Willem de Bruijn,
Jason Xing, Paul Chaignon, Chen Ridong, Lance Yang, Jiayuan Chen,
linux-kernel, bpf, cgroups, linux-mm, netdev, linux-kselftest
Cc: geliang, baohua, Hui Zhu
From: Hui Zhu <zhuhui@kylinos.cn>
To support features like allowing overrides in cgroup hierarchies,
we need a way to pass flags from userspace to the kernel when
attaching a struct_ops.
Extend `bpf_struct_ops_link` to include a `flags` field. This field
is populated from `attr->link_create.flags` during link creation. This
will allow struct_ops implementations, such as the upcoming memory
controller ops, to interpret these flags and modify their attachment
behavior accordingly.
The flags validation in bpf_struct_ops_link_create() is updated
to explicitly permit BPF_F_ALLOW_OVERRIDE in addition to the
already-allowed BPF_F_CGROUP_FD. Any other flag combination
will still be rejected with -EINVAL.
UAPI Change:
This patch updates the comment in include/uapi/linux/bpf.h to reflect
that the cgroup-bpf attach flags (such as BPF_F_ALLOW_OVERRIDE) are
now applicable to both BPF_PROG_ATTACH and BPF_LINK_CREATE commands.
Previously, these flags were only documented for BPF_PROG_ATTACH.
The actual flag definitions remain unchanged, so this is a compatible
extension of the existing API. Older userspace will continue to work
(by not passing flags), and newer userspace can opt-in to the new
functionality by setting appropriate flags.
Signed-off-by: Barry Song <baohua@kernel.org>
Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
---
include/linux/bpf.h | 1 +
include/uapi/linux/bpf.h | 2 +-
kernel/bpf/bpf_struct_ops.c | 4 +++-
tools/include/uapi/linux/bpf.h | 2 +-
4 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 743b4f0546b5..aae7f9837944 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1915,6 +1915,7 @@ struct bpf_struct_ops_link {
bool cgroup_removed;
struct list_head list;
wait_queue_head_t wait_hup;
+ u32 flags;
};
struct bpf_link_primer {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f547613986cc..85ab5bdf81ac 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1194,7 +1194,7 @@ enum bpf_perf_event_type {
BPF_PERF_EVENT_EVENT = 6,
};
-/* cgroup-bpf attach flags used in BPF_PROG_ATTACH command
+/* cgroup-bpf attach flags used in BPF_PROG_ATTACH and BPF_LINK_CREATE command
*
* NONE(default): No further bpf programs allowed in the subtree.
*
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 5333290957cb..1d15c667a300 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -1389,7 +1389,8 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
struct cgroup *cgrp;
int err;
- if (attr->link_create.flags & ~BPF_F_CGROUP_FD)
+ if (attr->link_create.flags & ~(BPF_F_CGROUP_FD |
+ BPF_F_ALLOW_OVERRIDE))
return -EINVAL;
map = bpf_map_get(attr->link_create.map_fd);
@@ -1427,6 +1428,7 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
goto err_out;
}
}
+ link->flags = attr->link_create.flags;
err = bpf_link_prime(&link->link, &link_primer);
if (err)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index fa075dc3b7eb..8a2b1f865d2b 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1194,7 +1194,7 @@ enum bpf_perf_event_type {
BPF_PERF_EVENT_EVENT = 6,
};
-/* cgroup-bpf attach flags used in BPF_PROG_ATTACH command
+/* cgroup-bpf attach flags used in BPF_PROG_ATTACH and BPF_LINK_CREATE command
*
* NONE(default): No further bpf programs allowed in the subtree.
*
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC PATCH bpf-next v7 06/11] mm: memcontrol: Add BPF struct_ops for memory controller
2026-05-26 2:20 [RFC PATCH bpf-next v7 00/11] mm: BPF struct_ops for dynamic memory protection and async reclaim Hui Zhu
` (4 preceding siblings ...)
2026-05-26 2:20 ` [RFC PATCH bpf-next v7 05/11] bpf: Pass flags in bpf_link_create for struct_ops Hui Zhu
@ 2026-05-26 2:24 ` Hui Zhu
2026-05-26 3:19 ` bot+bpf-ci
2026-05-26 2:24 ` [RFC PATCH bpf-next v7 07/11] mm/bpf: Add bpf_try_to_free_mem_cgroup_pages kfunc Hui Zhu
` (4 subsequent siblings)
10 siblings, 1 reply; 18+ messages in thread
From: Hui Zhu @ 2026-05-26 2:24 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, John Fastabend,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa,
Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Muchun Song, JP Kobryn, Andrew Morton, Shuah Khan, davem,
Jakub Kicinski, Jesper Dangaard Brouer, Stanislav Fomichev,
KP Singh, Tao Chen, Mykyta Yatsenko, Leon Hwang, Anton Protopopov,
Amery Hung, Tobias Klauser, Eyal Birger, Rong Tao, Hao Luo,
Peter Zijlstra, Miguel Ojeda, Nathan Chancellor, Kees Cook,
Tejun Heo, Jeff Xu, mkoutny, Jan Hendrik Farr, Christian Brauner,
Randy Dunlap, Brian Gerst, Masahiro Yamada, Willem de Bruijn,
Jason Xing, Paul Chaignon, Chen Ridong, Lance Yang, Jiayuan Chen,
linux-kernel, bpf, cgroups, linux-mm, netdev, linux-kselftest
Cc: geliang, baohua, Hui Zhu
From: Hui Zhu <zhuhui@kylinos.cn>
Introduce BPF struct_ops support to the memory controller, enabling
custom and dynamic control over memory pressure via a new struct_ops
type, `memcg_bpf_ops`.
The `memcg_bpf_ops` interface exposes the following hooks:
- `memcg_charged`: Called on the synchronous blocking charge path after
pages have been charged to the cgroup. Returns a custom throttling
delay in milliseconds. This value is used as a lower bound for the
penalty passed to `__mem_cgroup_handle_over_high()` and applies even
when `memory.high` is not breached, allowing BPF programs to impose
proactive back-pressure on any charge event. Return 0 for no delay.
- `memcg_uncharged`: Called when pages are uncharged from a cgroup,
allowing BPF programs to track or react to memory releases.
- `below_low`: Overrides the `memory.low` protection check. Receives
the effective low threshold (elow) and current usage as arguments.
If it returns true, the cgroup is treated as protected regardless of
the standard elow >= usage comparison. Returning false continues
to the normal kernel check.
- `below_min`: Same as `below_low`, but for `memory.min` protection.
Receives emin and usage as arguments.
- `handle_cgroup_online`/`offline`: Callbacks invoked when a cgroup
with an attached program comes online or goes offline, allowing BPF
programs to manage per-cgroup state.
These hooks are integrated into core memory control logic.
`memcg_charged` is consulted in `try_charge_memcg` on the synchronous
blocking path. To avoid losing the originally charged cgroup pointer as
the charge loop walks up the ancestor chain, `orig_memcg` is saved
before the loop begins. After the loop, the BPF hook is called with
`orig_memcg` and the actual batch size, and its result (converted from
milliseconds to jiffies) is stored as `bpf_high_delay`.
`__mem_cgroup_handle_over_high()` is then invoked when either
`bpf_high_delay` is non-zero or `memcg_nr_pages_over_high` exceeds
MEMCG_CHARGE_BATCH. Inside the function, the current task's memcg is
obtained independently via `get_mem_cgroup_from_mm()`. Reclaim is
attempted first; if reclaim makes forward progress or retries remain,
the function loops back to reclaim again rather than throttling
immediately. `bpf_high_delay` serves as a lower bound for the final
penalty via `max(penalty_jiffies, bpf_high_delay)`: when
`memcg_nr_pages_over_high` is zero (memory.high not breached),
the kernel overage calculation is skipped and `bpf_high_delay` alone
sets the penalty. In all cases, throttling only occurs if the resulting
penalty exceeds HZ/100; a BPF-requested delay below this threshold
causes no sleep. The deferred user-return path (via
`mem_cgroup_handle_over_high()`) always passes bpf_high_delay=0 since
BPF delay is evaluated exactly once, on the synchronous charge path.
`below_low` and `below_min` are inserted in their respective inline
functions after the unprotected check. The pre-read elow/emin and usage
values are forwarded to the BPF hook; on false return the standard
kernel comparison (elow >= usage) proceeds as normal.
Support for `BPF_F_ALLOW_OVERRIDE` is included. When a program is
registered with this flag, a descendant cgroup may later attach its own
`memcg_bpf_ops` to override the inherited program. Without this flag,
attaching to a cgroup that already has a program (whether attached
directly or inherited from an ancestor) will fail with -EBUSY.
On registration, ops are propagated to the cgroup itself and all its
descendants via `mem_cgroup_iter`. A `bpf_ops_flags` field is added to
`struct mem_cgroup` to persist the attachment flags, which are inherited
during `css_online` and restored to the parent's flags on
unregistration. On unregistration, rather than unconditionally clearing
`bpf_ops` to NULL throughout the subtree, each descendant that still
holds the unregistered ops pointer has its `bpf_ops` and
`bpf_ops_flags` restored to the values the registering cgroup's parent
held at that time. This correctly handles the override case where a
descendant had re-attached over an inherited program.
Lifecycle management ensures programs are inherited by child cgroups
on `css_online` and cleaned up on `css_offline`. SRCU (`memcg_bpf_srcu`)
protects concurrent read access to the `memcg->bpf_ops` pointer; all
writes are serialized under `cgroup_mutex`.
Signed-off-by: Barry Song <baohua@kernel.org>
Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
---
include/linux/memcontrol.h | 250 ++++++++++++++++++++++++++++++-
mm/bpf_memcontrol.c | 298 ++++++++++++++++++++++++++++++++++++-
mm/memcontrol.c | 43 ++++--
3 files changed, 574 insertions(+), 17 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index dc3fa687759b..30b7b8558ccb 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -23,6 +23,7 @@
#include <linux/writeback.h>
#include <linux/page-flags.h>
#include <linux/shrinker.h>
+#include <linux/srcu.h>
struct mem_cgroup;
struct obj_cgroup;
@@ -192,6 +193,59 @@ struct obj_cgroup {
bool is_root;
};
+#ifdef CONFIG_BPF_SYSCALL
+/*
+ * struct memcg_bpf_ops - BPF callbacks for memory cgroup operations
+ *
+ * @handle_cgroup_online: Called when a cgroup comes online. May be used
+ * by a BPF program to initialize per-cgroup state.
+ * @handle_cgroup_offline: Called when a cgroup goes offline. May be used
+ * to release per-cgroup state allocated in the
+ * online callback.
+ * @below_low: Override the memory.low protection check.
+ * Receives the effective low threshold @elow and the current
+ * memory usage @usage (both in pages). If the callback returns
+ * true, mem_cgroup_below_low() returns true immediately,
+ * treating the cgroup as protected regardless of the standard
+ * elow >= usage comparison. Returning false continues to
+ * the normal kernel check.
+ * @below_min: Same as @below_low, but for the memory.min protection check.
+ * Receives @emin and @usage. Returning true short-circuits the
+ * standard emin >= usage comparison.
+ * @memcg_charged: Called on the synchronous blocking charge path after
+ * pages have been charged to the cgroup. Returns a custom
+ * throttle delay in milliseconds. This delay is taken as
+ * a lower bound for the penalty in
+ * __mem_cgroup_handle_over_high() and applies even when
+ * memory.high is not breached. Return 0 for no extra delay.
+ * @memcg_uncharged: Called when pages are uncharged from the cgroup.
+ * Allows BPF programs to track memory releases or update
+ * accounting state. No return value.
+ *
+ * This structure defines the interface for BPF programs to customize
+ * memory cgroup behavior through struct_ops programs. All callbacks are
+ * non-sleepable. Concurrent readers are protected by SRCU
+ * (memcg_bpf_srcu); writers hold cgroup_mutex.
+ */
+struct memcg_bpf_ops {
+ void (*handle_cgroup_online)(struct mem_cgroup *memcg);
+
+ void (*handle_cgroup_offline)(struct mem_cgroup *memcg);
+
+ bool (*below_low)(struct mem_cgroup *memcg, unsigned long elow,
+ unsigned long usage);
+
+ bool (*below_min)(struct mem_cgroup *memcg, unsigned long emin,
+ unsigned long usage);
+
+ unsigned int (*memcg_charged)(struct mem_cgroup *memcg,
+ unsigned int nr_pages);
+
+ void (*memcg_uncharged)(struct mem_cgroup *memcg,
+ unsigned int nr_pages);
+};
+#endif /* CONFIG_BPF_SYSCALL */
+
/*
* The memory controller data structure. The memory controller controls both
* page cache and RSS per cgroup. We would eventually like to provide
@@ -323,6 +377,11 @@ struct mem_cgroup {
spinlock_t event_list_lock;
#endif /* CONFIG_MEMCG_V1 */
+#ifdef CONFIG_BPF_SYSCALL
+ struct memcg_bpf_ops *bpf_ops;
+ u32 bpf_ops_flags;
+#endif
+
struct mem_cgroup_per_node *nodeinfo[];
};
@@ -533,6 +592,165 @@ static inline bool mem_cgroup_disabled(void)
return !cgroup_subsys_enabled(memory_cgrp_subsys);
}
+#ifdef CONFIG_BPF_SYSCALL
+
+/* SRCU for protecting concurrent access to memcg->bpf_ops */
+extern struct srcu_struct memcg_bpf_srcu;
+
+/*
+ * BPF_MEMCG_CALL - Safely invoke a BPF memcg callback with return value
+ * @memcg: The memory cgroup whose bpf_ops to invoke
+ * @op: The callback name (struct member of memcg_bpf_ops)
+ * @default_val: Value to return if no BPF program is attached or the
+ * specific callback is not implemented
+ * @...: Additional arguments forwarded to the callback
+ *
+ * Uses a two-phase READ_ONCE() pattern:
+ * 1. An initial lockless READ_ONCE() provides a fast-path check.
+ * If bpf_ops is NULL the SRCU lock is never taken, keeping the
+ * common no-BPF path free of synchronization overhead.
+ * 2. A second READ_ONCE() after srcu_read_lock() ensures a consistent
+ * view of the pointer under the SRCU read section, guarding against
+ * a concurrent bpf_memcg_ops_unreg() that may be in progress.
+ */
+#define BPF_MEMCG_CALL(memcg, op, default_val, ...) ({ \
+ typeof(default_val) __ret = (default_val); \
+ struct memcg_bpf_ops *__ops; \
+ int __idx; \
+ \
+ if (unlikely(READ_ONCE((memcg)->bpf_ops))) { \
+ __idx = srcu_read_lock(&memcg_bpf_srcu); \
+ __ops = READ_ONCE((memcg)->bpf_ops); \
+ if (__ops && __ops->op) \
+ __ret = __ops->op(memcg, ##__VA_ARGS__);\
+ srcu_read_unlock(&memcg_bpf_srcu, __idx); \
+ } \
+ __ret; \
+})
+
+/*
+ * BPF_MEMCG_CALL_VOID - Safely invoke a void BPF memcg callback
+ * @memcg: The memory cgroup whose bpf_ops to invoke
+ * @op: The callback name (struct member of memcg_bpf_ops)
+ * @...: Additional arguments forwarded to the callback
+ *
+ * Same SRCU fast-path pattern as BPF_MEMCG_CALL but for callbacks
+ * that have no return value.
+ */
+#define BPF_MEMCG_CALL_VOID(memcg, op, ...) do { \
+ struct memcg_bpf_ops *__ops; \
+ int __idx; \
+ \
+ if (unlikely(READ_ONCE((memcg)->bpf_ops))) { \
+ __idx = srcu_read_lock(&memcg_bpf_srcu); \
+ __ops = READ_ONCE((memcg)->bpf_ops); \
+ if (__ops && __ops->op) \
+ __ops->op(memcg, ##__VA_ARGS__); \
+ srcu_read_unlock(&memcg_bpf_srcu, __idx); \
+ } \
+} while (0)
+
+static inline bool
+bpf_memcg_below_low(struct mem_cgroup *memcg, unsigned long elow,
+ unsigned long usage)
+{
+ return BPF_MEMCG_CALL(memcg, below_low, false, elow, usage);
+}
+
+static inline bool
+bpf_memcg_below_min(struct mem_cgroup *memcg, unsigned long emin,
+ unsigned long usage)
+{
+ return BPF_MEMCG_CALL(memcg, below_min, false, emin, usage);
+}
+
+static inline unsigned long
+bpf_memcg_charged(struct mem_cgroup *memcg, unsigned int nr_pages)
+{
+ unsigned int ret;
+
+ /*
+ * Retrieve the BPF-specified throttle delay in milliseconds and
+ * convert to jiffies for use in __mem_cgroup_handle_over_high().
+ */
+ ret = BPF_MEMCG_CALL(memcg, memcg_charged, 0U, nr_pages);
+ return msecs_to_jiffies(ret);
+}
+
+static inline void
+bpf_memcg_uncharged(struct mem_cgroup *memcg, unsigned int nr_pages)
+{
+ BPF_MEMCG_CALL_VOID(memcg, memcg_uncharged, nr_pages);
+}
+
+#undef BPF_MEMCG_CALL
+#undef BPF_MEMCG_CALL_VOID
+
+/*
+ * memcontrol_bpf_online - Inherit BPF ops for a newly online cgroup.
+ * @memcg: The memory cgroup coming online.
+ *
+ * Called under cgroup_mutex from mem_cgroup_css_online(). Inherits the
+ * parent's bpf_ops pointer and bpf_ops_flags into @memcg so that
+ * BPF-based memory control policies propagate down the hierarchy
+ * automatically.
+ *
+ * If the parent has no bpf_ops, this is a no-op. If it does, the ops
+ * pointer is copied and, if an online handler is implemented, it is
+ * invoked to allow the BPF program to initialize per-cgroup state for
+ * the new child.
+ *
+ * Locking: cgroup_mutex is held by the caller. Because bpf_memcg_ops_reg()
+ * and bpf_memcg_ops_unreg() also hold cgroup_mutex when writing
+ * memcg->bpf_ops, no additional lock on memcg_bpf_srcu is required here.
+ */
+extern void memcontrol_bpf_online(struct mem_cgroup *memcg);
+
+/*
+ * memcontrol_bpf_offline - Run BPF cleanup for a cgroup going offline.
+ * @memcg: The memory cgroup going offline.
+ *
+ * Called under cgroup_mutex from mem_cgroup_css_offline(). If a BPF
+ * program is attached and implements a handle_cgroup_offline callback,
+ * it is invoked so the program can release any per-cgroup state before
+ * the memcg is freed.
+ *
+ * Locking: same as memcontrol_bpf_online() — cgroup_mutex is held.
+ */
+extern void memcontrol_bpf_offline(struct mem_cgroup *memcg);
+
+#else /* CONFIG_BPF_SYSCALL */
+
+static inline unsigned long
+bpf_memcg_charged(struct mem_cgroup *memcg, unsigned int nr_pages)
+{
+ return 0;
+}
+
+static inline void
+bpf_memcg_uncharged(struct mem_cgroup *memcg, unsigned int nr_pages)
+{
+}
+
+static inline bool
+bpf_memcg_below_low(struct mem_cgroup *memcg, unsigned long elow,
+ unsigned long usage)
+{
+ return false;
+}
+
+static inline bool
+bpf_memcg_below_min(struct mem_cgroup *memcg, unsigned long emin,
+ unsigned long usage)
+{
+ return false;
+}
+
+static inline void memcontrol_bpf_online(struct mem_cgroup *memcg) { }
+static inline void memcontrol_bpf_offline(struct mem_cgroup *memcg) { }
+
+#endif /* CONFIG_BPF_SYSCALL */
+
static inline void mem_cgroup_protection(struct mem_cgroup *root,
struct mem_cgroup *memcg,
unsigned long *min,
@@ -603,21 +821,35 @@ static inline bool mem_cgroup_unprotected(struct mem_cgroup *target,
static inline bool mem_cgroup_below_low(struct mem_cgroup *target,
struct mem_cgroup *memcg)
{
+ unsigned long elow, usage;
+
if (mem_cgroup_unprotected(target, memcg))
return false;
- return READ_ONCE(memcg->memory.elow) >=
- page_counter_read(&memcg->memory);
+ elow = READ_ONCE(memcg->memory.elow);
+ usage = page_counter_read(&memcg->memory);
+
+ if (bpf_memcg_below_low(memcg, elow, usage))
+ return true;
+
+ return elow >= usage;
}
static inline bool mem_cgroup_below_min(struct mem_cgroup *target,
struct mem_cgroup *memcg)
{
+ unsigned long emin, usage;
+
if (mem_cgroup_unprotected(target, memcg))
return false;
- return READ_ONCE(memcg->memory.emin) >=
- page_counter_read(&memcg->memory);
+ emin = READ_ONCE(memcg->memory.emin);
+ usage = page_counter_read(&memcg->memory);
+
+ if (bpf_memcg_below_min(memcg, emin, usage))
+ return true;
+
+ return emin >= usage;
}
int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp);
@@ -890,12 +1122,18 @@ unsigned long mem_cgroup_get_zone_lru_size(struct lruvec *lruvec,
return READ_ONCE(mz->lru_zone_size[zone_idx][lru]);
}
-void __mem_cgroup_handle_over_high(gfp_t gfp_mask);
+void __mem_cgroup_handle_over_high(gfp_t gfp_mask,
+ unsigned long bpf_high_delay);
static inline void mem_cgroup_handle_over_high(gfp_t gfp_mask)
{
if (unlikely(current->memcg_nr_pages_over_high))
- __mem_cgroup_handle_over_high(gfp_mask);
+ /*
+ * Deferred user-return path: no BPF delay lookup here.
+ * BPF-provided delay is injected from try_charge_memcg()
+ * on the synchronous blocking charge path.
+ */
+ __mem_cgroup_handle_over_high(gfp_mask, 0);
}
unsigned long mem_cgroup_get_max(struct mem_cgroup *memcg);
diff --git a/mm/bpf_memcontrol.c b/mm/bpf_memcontrol.c
index 716df49d7647..1f726a7b22e3 100644
--- a/mm/bpf_memcontrol.c
+++ b/mm/bpf_memcontrol.c
@@ -8,6 +8,9 @@
#include <linux/memcontrol.h>
#include <linux/bpf.h>
+/* Protects memcg->bpf_ops pointer for read and write. */
+DEFINE_SRCU(memcg_bpf_srcu);
+
__bpf_kfunc_start_defs();
/**
@@ -179,15 +182,306 @@ static const struct btf_kfunc_id_set bpf_memcontrol_kfunc_set = {
.set = &bpf_memcontrol_kfuncs,
};
+/**
+ * memcontrol_bpf_online - Inherit BPF programs for a new online cgroup.
+ * @memcg: The memory cgroup that is coming online.
+ *
+ * When a new memcg is brought online, it inherits the BPF programs
+ * attached to its parent. This ensures consistent BPF-based memory
+ * control policies throughout the cgroup hierarchy.
+ *
+ * After inheriting, if the BPF program has an online handler, it is
+ * invoked for the new memcg.
+ */
+void memcontrol_bpf_online(struct mem_cgroup *memcg)
+{
+ struct memcg_bpf_ops *ops;
+ struct mem_cgroup *parent_memcg;
+
+ /* The root cgroup does not inherit from a parent. */
+ if (mem_cgroup_is_root(memcg))
+ return;
+
+ /*
+ * Because only functions bpf_memcg_ops_reg and bpf_memcg_ops_unreg
+ * write to memcg->bpf_ops and memcg->bpf_ops_flags under the
+ * protection of cgroup_mutex, ensuring that cgroup_mutex is already
+ * locked here allows safe reading and writing of memcg->bpf_ops and
+ * memcg->bpf_ops_flags without needing to acquire a lock on
+ * memcg_bpf_srcu.
+ */
+ lockdep_assert_held(&cgroup_mutex);
+
+ parent_memcg = parent_mem_cgroup(memcg);
+
+ /* Inherit the BPF program from the parent cgroup. */
+ ops = READ_ONCE(parent_memcg->bpf_ops);
+ if (!ops)
+ return;
+ WRITE_ONCE(memcg->bpf_ops, ops);
+ memcg->bpf_ops_flags = parent_memcg->bpf_ops_flags;
+
+ /*
+ * If the BPF program implements it, call the online handler to
+ * allow the program to perform setup tasks for the new cgroup.
+ */
+ if (ops->handle_cgroup_online)
+ ops->handle_cgroup_online(memcg);
+}
+
+/**
+ * memcontrol_bpf_offline - Run BPF cleanup for an offline cgroup.
+ * @memcg: The memory cgroup that is going offline.
+ *
+ * If a BPF program is attached and implements an offline handler,
+ * it is invoked to perform cleanup tasks before the memcg goes
+ * completely offline.
+ */
+void memcontrol_bpf_offline(struct mem_cgroup *memcg)
+{
+ struct memcg_bpf_ops *ops;
+
+ /* Same locking rules as memcontrol_bpf_online(). */
+ lockdep_assert_held(&cgroup_mutex);
+
+ ops = READ_ONCE(memcg->bpf_ops);
+ if (!ops || !ops->handle_cgroup_offline)
+ return;
+
+ ops->handle_cgroup_offline(memcg);
+}
+
+static int memcg_ops_btf_struct_access(struct bpf_verifier_log *log,
+ const struct bpf_reg_state *reg,
+ int off, int size)
+{
+ return -EACCES;
+}
+
+static bool memcg_ops_is_valid_access(int off, int size, enum bpf_access_type type,
+ const struct bpf_prog *prog,
+ struct bpf_insn_access_aux *info)
+{
+ return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
+}
+
+const struct bpf_verifier_ops bpf_memcg_verifier_ops = {
+ .get_func_proto = bpf_base_func_proto,
+ .btf_struct_access = memcg_ops_btf_struct_access,
+ .is_valid_access = memcg_ops_is_valid_access,
+};
+
+static void cfi_handle_cgroup_online(struct mem_cgroup *memcg)
+{
+}
+
+static void cfi_handle_cgroup_offline(struct mem_cgroup *memcg)
+{
+}
+
+static bool
+cfi_below_low(struct mem_cgroup *memcg, unsigned long elow,
+ unsigned long usage)
+{
+ return false;
+}
+
+static bool
+cfi_below_min(struct mem_cgroup *memcg, unsigned long emin,
+ unsigned long usage)
+{
+ return false;
+}
+
+static unsigned int cfi_memcg_charged(struct mem_cgroup *memcg,
+ unsigned int nr_pages)
+{
+ return 0;
+}
+
+static void cfi_memcg_uncharged(struct mem_cgroup *memcg, unsigned int nr_pages)
+{
+}
+
+static struct memcg_bpf_ops cfi_bpf_memcg_ops = {
+ .handle_cgroup_online = cfi_handle_cgroup_online,
+ .handle_cgroup_offline = cfi_handle_cgroup_offline,
+ .below_low = cfi_below_low,
+ .below_min = cfi_below_min,
+ .memcg_charged = cfi_memcg_charged,
+ .memcg_uncharged = cfi_memcg_uncharged,
+};
+
+static int bpf_memcg_ops_init(struct btf *btf)
+{
+ return 0;
+}
+
+static int bpf_memcg_ops_check_member(const struct btf_type *t,
+ const struct btf_member *member,
+ const struct bpf_prog *prog)
+{
+ u32 moff = __btf_member_bit_offset(t, member) / 8;
+
+ switch (moff) {
+ case offsetof(struct memcg_bpf_ops, handle_cgroup_online):
+ case offsetof(struct memcg_bpf_ops, handle_cgroup_offline):
+ case offsetof(struct memcg_bpf_ops, below_low):
+ case offsetof(struct memcg_bpf_ops, below_min):
+ case offsetof(struct memcg_bpf_ops, memcg_charged):
+ case offsetof(struct memcg_bpf_ops, memcg_uncharged):
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ if (prog->sleepable)
+ return -EINVAL;
+
+ return 0;
+}
+
+static int bpf_memcg_ops_init_member(const struct btf_type *t,
+ const struct btf_member *member,
+ void *kdata, const void *udata)
+{
+ return 0;
+}
+
+static int bpf_memcg_ops_reg(void *kdata, struct bpf_link *link)
+{
+ struct bpf_struct_ops_link *ops_link;
+ struct memcg_bpf_ops *ops = kdata, *old_ops;
+ struct cgroup_subsys_state *css;
+ struct mem_cgroup *memcg, *iter;
+ int err = 0;
+
+ if (!link)
+ return -EOPNOTSUPP;
+ ops_link = container_of(link, struct bpf_struct_ops_link, link);
+ if (!ops_link->cgroup)
+ return -EINVAL;
+
+ cgroup_lock();
+
+ css = cgroup_e_css(ops_link->cgroup, &memory_cgrp_subsys);
+ if (!css) {
+ err = -EINVAL;
+ goto unlock_out;
+ }
+ memcg = mem_cgroup_from_css(css);
+
+ /*
+ * Check if memcg has bpf_ops and whether it is inherited from
+ * parent.
+ * If inherited and BPF_F_ALLOW_OVERRIDE is set, allow override.
+ */
+ old_ops = READ_ONCE(memcg->bpf_ops);
+ if (old_ops) {
+ struct mem_cgroup *parent_memcg = parent_mem_cgroup(memcg);
+
+ if (!parent_memcg ||
+ !(memcg->bpf_ops_flags & BPF_F_ALLOW_OVERRIDE) ||
+ READ_ONCE(parent_memcg->bpf_ops) != old_ops) {
+ err = -EBUSY;
+ goto unlock_out;
+ }
+ }
+
+ /* Check for incompatible bpf_ops in descendants. */
+ iter = NULL;
+ while ((iter = mem_cgroup_iter(memcg, iter, NULL))) {
+ struct memcg_bpf_ops *iter_ops = READ_ONCE(iter->bpf_ops);
+
+ if (iter_ops && iter_ops != old_ops) {
+ /* cannot override existing bpf_ops of sub-cgroup. */
+ mem_cgroup_iter_break(memcg, iter);
+ err = -EBUSY;
+ goto unlock_out;
+ }
+ }
+
+ iter = NULL;
+ while ((iter = mem_cgroup_iter(memcg, iter, NULL))) {
+ WRITE_ONCE(iter->bpf_ops, ops);
+ iter->bpf_ops_flags = ops_link->flags;
+ }
+
+unlock_out:
+ cgroup_unlock();
+ return err;
+}
+
+/* Unregister the struct ops instance */
+static void bpf_memcg_ops_unreg(void *kdata, struct bpf_link *link)
+{
+ struct bpf_struct_ops_link *ops_link;
+ struct memcg_bpf_ops *ops = kdata;
+ struct cgroup_subsys_state *css;
+ struct mem_cgroup *memcg;
+ struct mem_cgroup *iter;
+ struct memcg_bpf_ops *parent_bpf_ops = NULL;
+ u32 parent_bpf_ops_flags = 0;
+
+ if (!link)
+ return;
+ ops_link = container_of(link, struct bpf_struct_ops_link, link);
+ if (!ops_link->cgroup)
+ return;
+
+ cgroup_lock();
+
+ css = cgroup_e_css(ops_link->cgroup, &memory_cgrp_subsys);
+ if (!css)
+ goto unlock_out;
+ memcg = mem_cgroup_from_css(css);
+
+ /* Get the parent bpf_ops and bpf_ops_flags */
+ iter = parent_mem_cgroup(memcg);
+ if (iter) {
+ parent_bpf_ops = READ_ONCE(iter->bpf_ops);
+ parent_bpf_ops_flags = iter->bpf_ops_flags;
+ }
+
+ iter = NULL;
+ while ((iter = mem_cgroup_iter(memcg, iter, NULL))) {
+ if (READ_ONCE(iter->bpf_ops) == ops) {
+ WRITE_ONCE(iter->bpf_ops, parent_bpf_ops);
+ iter->bpf_ops_flags = parent_bpf_ops_flags;
+ }
+ }
+
+unlock_out:
+ cgroup_unlock();
+ synchronize_srcu(&memcg_bpf_srcu);
+}
+
+static struct bpf_struct_ops bpf_memcg_bpf_ops = {
+ .verifier_ops = &bpf_memcg_verifier_ops,
+ .init = bpf_memcg_ops_init,
+ .check_member = bpf_memcg_ops_check_member,
+ .init_member = bpf_memcg_ops_init_member,
+ .reg = bpf_memcg_ops_reg,
+ .unreg = bpf_memcg_ops_unreg,
+ .name = "memcg_bpf_ops",
+ .owner = THIS_MODULE,
+ .cfi_stubs = &cfi_bpf_memcg_ops,
+};
+
static int __init bpf_memcontrol_init(void)
{
- int err;
+ int err, err2;
err = register_btf_kfunc_id_set(BPF_PROG_TYPE_UNSPEC,
&bpf_memcontrol_kfunc_set);
if (err)
pr_warn("error while registering bpf memcontrol kfuncs: %d", err);
- return err;
+ err2 = register_bpf_struct_ops(&bpf_memcg_bpf_ops, memcg_bpf_ops);
+ if (err2)
+ pr_warn("error while registering memcontrol bpf ops: %d\n",
+ err2);
+
+ return err ? err : err2;
}
late_initcall(bpf_memcontrol_init);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c03d4787d466..ec912d19ef87 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2085,6 +2085,8 @@ static void memcg_uncharge(struct mem_cgroup *memcg, unsigned int nr_pages)
page_counter_uncharge(&memcg->memory, nr_pages);
if (do_memsw_account())
page_counter_uncharge(&memcg->memsw, nr_pages);
+
+ bpf_memcg_uncharged(memcg, nr_pages);
}
/*
@@ -2473,8 +2475,12 @@ static unsigned long calculate_high_delay(struct mem_cgroup *memcg,
* Reclaims memory over the high limit. Called directly from
* try_charge() (context permitting), as well as from the userland
* return path where reclaim is always able to block.
+ *
+ * @bpf_high_delay is caller-provided extra delay. Callers that do
+ * not evaluate BPF delay (e.g. deferred return-path handling) pass 0.
*/
-void __mem_cgroup_handle_over_high(gfp_t gfp_mask)
+void
+__mem_cgroup_handle_over_high(gfp_t gfp_mask, unsigned long bpf_high_delay)
{
unsigned long penalty_jiffies;
unsigned long pflags;
@@ -2516,11 +2522,15 @@ void __mem_cgroup_handle_over_high(gfp_t gfp_mask)
* memory.high is breached and reclaim is unable to keep up. Throttle
* allocators proactively to slow down excessive growth.
*/
- penalty_jiffies = calculate_high_delay(memcg, nr_pages,
- mem_find_max_overage(memcg));
+ if (nr_pages) {
+ penalty_jiffies = calculate_high_delay(
+ memcg, nr_pages, mem_find_max_overage(memcg));
- penalty_jiffies += calculate_high_delay(memcg, nr_pages,
- swap_find_max_overage(memcg));
+ penalty_jiffies += calculate_high_delay(
+ memcg, nr_pages, swap_find_max_overage(memcg));
+ } else
+ penalty_jiffies = 0;
+ penalty_jiffies = max(penalty_jiffies, bpf_high_delay);
/*
* Clamp the max delay per usermode return so as to still keep the
@@ -2578,6 +2588,8 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
bool raised_max_event = false;
unsigned long pflags;
bool allow_spinning = gfpflags_allow_spinning(gfp_mask);
+ struct mem_cgroup *orig_memcg;
+ unsigned long bpf_high_delay;
retry:
if (consume_stock(memcg, nr_pages))
@@ -2704,6 +2716,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
if (batch > nr_pages)
refill_stock(memcg, batch - nr_pages);
+ orig_memcg = memcg;
/*
* If the hierarchy is above the normal consumption range, schedule
* reclaim on returning to userland. We can perform reclaim here
@@ -2746,6 +2759,8 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
}
} while ((memcg = parent_mem_cgroup(memcg)));
+ bpf_high_delay = bpf_memcg_charged(orig_memcg, batch);
+
/*
* Reclaim is set up above to be called from the userland
* return path. But also attempt synchronous reclaim to avoid
@@ -2753,10 +2768,17 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
* kernel. If this is successful, the return path will see it
* when it rechecks the overage and simply bail out.
*/
- if (current->memcg_nr_pages_over_high > MEMCG_CHARGE_BATCH &&
- !(current->flags & PF_MEMALLOC) &&
- gfpflags_allow_blocking(gfp_mask))
- __mem_cgroup_handle_over_high(gfp_mask);
+ if (!(current->flags & PF_MEMALLOC) &&
+ gfpflags_allow_blocking(gfp_mask)) {
+ /*
+ * BPF high-delay is evaluated only on the synchronous
+ * blocking path. The deferred user-return path calls
+ * __mem_cgroup_handle_over_high() with bpf_high_delay == 0.
+ */
+ if (bpf_high_delay ||
+ current->memcg_nr_pages_over_high > MEMCG_CHARGE_BATCH)
+ __mem_cgroup_handle_over_high(gfp_mask, bpf_high_delay);
+ }
return 0;
}
@@ -4151,6 +4173,8 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
*/
xa_store(&mem_cgroup_private_ids, memcg->id.id, memcg, GFP_KERNEL);
+ memcontrol_bpf_online(memcg);
+
return 0;
free_objcg:
for_each_node(nid) {
@@ -4188,6 +4212,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
zswap_memcg_offline_cleanup(memcg);
+ memcontrol_bpf_offline(memcg);
memcg_offline_kmem(memcg);
reparent_deferred_split_queue(memcg);
/*
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC PATCH bpf-next v7 07/11] mm/bpf: Add bpf_try_to_free_mem_cgroup_pages kfunc
2026-05-26 2:20 [RFC PATCH bpf-next v7 00/11] mm: BPF struct_ops for dynamic memory protection and async reclaim Hui Zhu
` (5 preceding siblings ...)
2026-05-26 2:24 ` [RFC PATCH bpf-next v7 06/11] mm: memcontrol: Add BPF struct_ops for memory controller Hui Zhu
@ 2026-05-26 2:24 ` Hui Zhu
2026-05-26 3:06 ` bot+bpf-ci
2026-05-26 2:24 ` [RFC PATCH bpf-next v7 08/11] selftests/bpf: Add tests for memcg_bpf_ops Hui Zhu
` (3 subsequent siblings)
10 siblings, 1 reply; 18+ messages in thread
From: Hui Zhu @ 2026-05-26 2:24 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, John Fastabend,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa,
Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Muchun Song, JP Kobryn, Andrew Morton, Shuah Khan, davem,
Jakub Kicinski, Jesper Dangaard Brouer, Stanislav Fomichev,
KP Singh, Tao Chen, Mykyta Yatsenko, Leon Hwang, Anton Protopopov,
Amery Hung, Tobias Klauser, Eyal Birger, Rong Tao, Hao Luo,
Peter Zijlstra, Miguel Ojeda, Nathan Chancellor, Kees Cook,
Tejun Heo, Jeff Xu, mkoutny, Jan Hendrik Farr, Christian Brauner,
Randy Dunlap, Brian Gerst, Masahiro Yamada, Willem de Bruijn,
Jason Xing, Paul Chaignon, Chen Ridong, Lance Yang, Jiayuan Chen,
linux-kernel, bpf, cgroups, linux-mm, netdev, linux-kselftest
Cc: geliang, baohua, Hui Zhu
From: Hui Zhu <zhuhui@kylinos.cn>
Expose the memory cgroup reclaim interface to BPF programs by adding
the bpf_try_to_free_mem_cgroup_pages kfunc. This allows BPF to
trigger memory reclamation for a specific cgroup.
The kfunc wraps try_to_free_mem_cgroup_pages and introduces a
swappiness parameter with the following semantics:
Values in [MIN_SWAPPINESS, SWAPPINESS_ANON_ONLY] are passed through
as an explicit swappiness override.
Values below MIN_SWAPPINESS indicate the use of the system default
(passed as NULL to the core reclaim path).
Values above SWAPPINESS_ANON_ONLY are rejected as invalid (-EINVAL).
Note that the swappiness override is only respected by the core
reclaim path if the MEMCG_RECLAIM_PROACTIVE flag is set in
reclaim_options.
Swap usage during reclaim is gated on reclaim_options: swap is
considered only when MEMCG_RECLAIM_MAY_SWAP is set. Without this
flag, reclaim is restricted to file-backed pages regardless of the
swappiness value or the cgroup's swappiness setting.
Also include <linux/swap.h> for the swappiness macro definitions and
register the function with the KF_SLEEPABLE flag.
Signed-off-by: Barry Song <baohua@kernel.org>
Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
---
mm/bpf_memcontrol.c | 57 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 57 insertions(+)
diff --git a/mm/bpf_memcontrol.c b/mm/bpf_memcontrol.c
index 1f726a7b22e3..0353c8736aa5 100644
--- a/mm/bpf_memcontrol.c
+++ b/mm/bpf_memcontrol.c
@@ -6,6 +6,7 @@
*/
#include <linux/memcontrol.h>
+#include <linux/swap.h>
#include <linux/bpf.h>
/* Protects memcg->bpf_ops pointer for read and write. */
@@ -162,6 +163,60 @@ __bpf_kfunc void bpf_mem_cgroup_flush_stats(struct mem_cgroup *memcg)
mem_cgroup_flush_stats(memcg);
}
+/**
+ * bpf_try_to_free_mem_cgroup_pages - attempt to reclaim pages from
+ * a memory cgroup
+ * @memcg: the target memory cgroup to reclaim from
+ * @nr_pages: the number of pages to reclaim
+ * @gfp_mask: GFP flags controlling the reclaim behavior
+ * @reclaim_options: bitmask of MEMCG_RECLAIM_* flags to tune
+ * reclaim strategy
+ * @swappiness: swappiness override value, or a sentinel to use
+ * the default
+ *
+ * BPF-facing wrapper around try_to_free_mem_cgroup_pages() that
+ * validates and translates the @swappiness argument before
+ * delegating to the core reclaim path.
+ *
+ * The @swappiness parameter follows these semantics:
+ * - Values in [MIN_SWAPPINESS, SWAPPINESS_ANON_ONLY] are passed
+ * through as an explicit swappiness override.
+ * - Values below MIN_SWAPPINESS are treated as "use the system
+ * default"; the override pointer is set to NULL and the cgroup's
+ * own swappiness setting takes effect.
+ * - Values above SWAPPINESS_ANON_ONLY are rejected as invalid.
+ * - If @reclaim_options does not include MEMCG_RECLAIM_PROACTIVE,
+ * the @swappiness override is ignored entirely by the core
+ * reclaim path and the system default is used regardless.
+ *
+ * Swap usage during reclaim is gated on @reclaim_options: swap is
+ * considered only when MEMCG_RECLAIM_MAY_SWAP is set. Without this
+ * flag, reclaim is restricted to file-backed pages regardless of the
+ * @swappiness value or the cgroup's swappiness setting.
+ *
+ * Return:
+ * The number of pages actually reclaimed on success, or -%EINVAL
+ * if @swappiness exceeds SWAPPINESS_ANON_ONLY.
+ */
+unsigned long bpf_try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
+ unsigned long nr_pages,
+ gfp_t gfp_mask,
+ unsigned int reclaim_options,
+ int swappiness)
+{
+ int *swapiness_ptr;
+
+ if (swappiness > SWAPPINESS_ANON_ONLY)
+ return -EINVAL;
+ else if (swappiness < MIN_SWAPPINESS)
+ swapiness_ptr = NULL;
+ else
+ swapiness_ptr = &swappiness;
+
+ return try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask,
+ reclaim_options, swapiness_ptr);
+}
+
__bpf_kfunc_end_defs();
BTF_KFUNCS_START(bpf_memcontrol_kfuncs)
@@ -175,6 +230,8 @@ BTF_ID_FLAGS(func, bpf_mem_cgroup_usage)
BTF_ID_FLAGS(func, bpf_mem_cgroup_page_state)
BTF_ID_FLAGS(func, bpf_mem_cgroup_flush_stats, KF_SLEEPABLE)
+BTF_ID_FLAGS(func, bpf_try_to_free_mem_cgroup_pages, KF_SLEEPABLE)
+
BTF_KFUNCS_END(bpf_memcontrol_kfuncs)
static const struct btf_kfunc_id_set bpf_memcontrol_kfunc_set = {
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC PATCH bpf-next v7 08/11] selftests/bpf: Add tests for memcg_bpf_ops
2026-05-26 2:20 [RFC PATCH bpf-next v7 00/11] mm: BPF struct_ops for dynamic memory protection and async reclaim Hui Zhu
` (6 preceding siblings ...)
2026-05-26 2:24 ` [RFC PATCH bpf-next v7 07/11] mm/bpf: Add bpf_try_to_free_mem_cgroup_pages kfunc Hui Zhu
@ 2026-05-26 2:24 ` Hui Zhu
2026-05-26 2:27 ` [RFC PATCH bpf-next v7 09/11] selftests/bpf: Add test for memcg_bpf_ops hierarchies Hui Zhu
` (2 subsequent siblings)
10 siblings, 0 replies; 18+ messages in thread
From: Hui Zhu @ 2026-05-26 2:24 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, John Fastabend,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa,
Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Muchun Song, JP Kobryn, Andrew Morton, Shuah Khan, davem,
Jakub Kicinski, Jesper Dangaard Brouer, Stanislav Fomichev,
KP Singh, Tao Chen, Mykyta Yatsenko, Leon Hwang, Anton Protopopov,
Amery Hung, Tobias Klauser, Eyal Birger, Rong Tao, Hao Luo,
Peter Zijlstra, Miguel Ojeda, Nathan Chancellor, Kees Cook,
Tejun Heo, Jeff Xu, mkoutny, Jan Hendrik Farr, Christian Brauner,
Randy Dunlap, Brian Gerst, Masahiro Yamada, Willem de Bruijn,
Jason Xing, Paul Chaignon, Chen Ridong, Lance Yang, Jiayuan Chen,
linux-kernel, bpf, cgroups, linux-mm, netdev, linux-kselftest
Cc: geliang, baohua, Hui Zhu
From: Hui Zhu <zhuhui@kylinos.cn>
Add a comprehensive selftest suite for the `memcg_bpf_ops`
functionality. These tests validate that BPF programs can correctly
influence memory cgroup throttling behavior by implementing the new
hooks.
The test suite is added in `prog_tests/memcg_ops.c` and covers
several key scenarios:
1. `test_memcg_ops_over_high`:
Verifies that a BPF program can trigger throttling on a low-priority
cgroup by returning a delay from the `get_high_delay_ms` hook when a
high-priority cgroup is under pressure.
2. `test_memcg_ops_below_low_over_high`:
Tests the combination of the `below_low` and `get_high_delay_ms`
hooks, ensuring they work together as expected.
3. `test_memcg_ops_below_min_over_high`:
Validates the interaction between the `below_min` and
`get_high_delay_ms` hooks.
The test framework sets up a cgroup hierarchy with high and low
priority groups, attaches BPF programs, runs memory-intensive
workloads, and asserts that the observed throttling (measured by
workload execution time) matches expectations.
The BPF program (`progs/memcg_ops.c`) uses a tracepoint on
`memcg:count_memcg_events` (specifically PGFAULT) to detect memory
pressure and trigger the appropriate hooks in response. This test
suite provides essential validation for the new memory control
mechanisms.
Signed-off-by: Barry Song <baohua@kernel.org>
Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
---
MAINTAINERS | 2 +
.../selftests/bpf/prog_tests/memcg_ops.c | 561 ++++++++++++++++++
tools/testing/selftests/bpf/progs/memcg_ops.c | 132 +++++
3 files changed, 695 insertions(+)
create mode 100644 tools/testing/selftests/bpf/prog_tests/memcg_ops.c
create mode 100644 tools/testing/selftests/bpf/progs/memcg_ops.c
diff --git a/MAINTAINERS b/MAINTAINERS
index dfc621ff629d..1be243e544da 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6567,6 +6567,8 @@ F: mm/memcontrol-v1.h
F: mm/page_counter.c
F: mm/swap_cgroup.c
F: samples/cgroup/*
+F: tools/testing/selftests/bpf/prog_tests/memcg_ops.c
+F: tools/testing/selftests/bpf/progs/memcg_ops.c
F: tools/testing/selftests/cgroup/memcg_protection.m
F: tools/testing/selftests/cgroup/test_hugetlb_memcg.c
F: tools/testing/selftests/cgroup/test_kmem.c
diff --git a/tools/testing/selftests/bpf/prog_tests/memcg_ops.c b/tools/testing/selftests/bpf/prog_tests/memcg_ops.c
new file mode 100644
index 000000000000..19fd4fde2266
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/memcg_ops.c
@@ -0,0 +1,561 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Memory controller eBPF struct ops test
+ */
+
+#include <test_progs.h>
+#include <bpf/btf.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include "cgroup_helpers.h"
+
+struct local_config {
+ u64 threshold;
+ u64 high_cgroup_id;
+ bool use_below_low;
+ bool use_below_min;
+ unsigned int over_high_ms;
+} local_config;
+
+#include "memcg_ops.skel.h"
+
+#define TRIGGER_THRESHOLD 1
+#define OVER_HIGH_MS 2000
+#define FILE_SIZE (64 * 1024 * 1024ul)
+#define BUFFER_SIZE (4096)
+#define CG_LIMIT (120 * 1024 * 1024ul)
+
+#define CG_DIR "/memcg_ops_test"
+#define CG_HIGH_DIR CG_DIR "/high"
+#define CG_LOW_DIR CG_DIR "/low"
+
+static int
+setup_high_low_cgroups(u64 *high_cgroup_id, int *low_cgroup_fd,
+ int *high_cgroup_fd)
+{
+ int ret;
+ char limit_buf[20];
+
+ ret = setup_cgroup_environment();
+ if (!ASSERT_OK(ret, "setup_cgroup_environment"))
+ goto cleanup;
+
+ ret = create_and_get_cgroup(CG_DIR);
+ if (!ASSERT_GE(ret, 0, "create_and_get_cgroup "CG_DIR))
+ goto cleanup;
+ close(ret);
+ ret = enable_controllers(CG_DIR, "memory");
+ if (!ASSERT_OK(ret, "enable_controllers"))
+ goto cleanup;
+ snprintf(limit_buf, 20, "%lu", CG_LIMIT);
+ ret = write_cgroup_file(CG_DIR, "memory.max", limit_buf);
+ if (!ASSERT_OK(ret, "write_cgroup_file memory.max"))
+ goto cleanup;
+ ret = write_cgroup_file(CG_DIR, "memory.swap.max", "0");
+ if (!ASSERT_OK(ret, "write_cgroup_file memory.swap.max"))
+ goto cleanup;
+
+ ret = create_and_get_cgroup(CG_HIGH_DIR);
+ if (!ASSERT_GE(ret, 0, "create_and_get_cgroup "CG_HIGH_DIR))
+ goto cleanup;
+ if (high_cgroup_fd)
+ *high_cgroup_fd = ret;
+ else
+ close(ret);
+ *high_cgroup_id = get_cgroup_id(CG_HIGH_DIR);
+ if (!ASSERT_GT(*high_cgroup_id, 0, "get_cgroup_id"))
+ goto cleanup;
+
+ ret = create_and_get_cgroup(CG_LOW_DIR);
+ if (!ASSERT_GE(ret, 0, "create_and_get_cgroup "CG_LOW_DIR))
+ goto cleanup;
+ if (low_cgroup_fd)
+ *low_cgroup_fd = ret;
+ else
+ close(ret);
+
+ return 0;
+
+cleanup:
+ cleanup_cgroup_environment();
+ return -1;
+}
+
+int write_file(const char *filename)
+{
+ int ret = -1;
+ size_t written = 0;
+ char *buffer;
+ FILE *fp;
+
+ fp = fopen(filename, "wb");
+ if (!fp)
+ goto out;
+
+ buffer = malloc(BUFFER_SIZE);
+ if (!buffer)
+ goto cleanup_fp;
+
+ memset(buffer, 'A', BUFFER_SIZE);
+
+ while (written < FILE_SIZE) {
+ size_t to_write = (FILE_SIZE - written < BUFFER_SIZE) ?
+ (FILE_SIZE - written) :
+ BUFFER_SIZE;
+
+ if (fwrite(buffer, 1, to_write, fp) != to_write)
+ goto cleanup;
+ written += to_write;
+ }
+
+ ret = 0;
+cleanup:
+ free(buffer);
+cleanup_fp:
+ fclose(fp);
+out:
+ return ret;
+}
+
+int read_file(const char *filename, int iterations)
+{
+ int ret = -1;
+ long page_size = sysconf(_SC_PAGESIZE);
+ char *p;
+ char *map;
+ size_t i;
+ int fd;
+ struct stat sb;
+
+ fd = open(filename, O_RDONLY);
+ if (fd == -1)
+ goto out;
+
+ if (fstat(fd, &sb) == -1)
+ goto cleanup_fd;
+
+ if (sb.st_size != FILE_SIZE) {
+ fprintf(stderr, "File size mismatch: expected %lu, got %lu\n",
+ (unsigned long)FILE_SIZE, (unsigned long)sb.st_size);
+ goto cleanup_fd;
+ }
+
+ map = mmap(NULL, FILE_SIZE, PROT_READ, MAP_PRIVATE, fd, 0);
+ if (map == MAP_FAILED)
+ goto cleanup_fd;
+
+ for (int iter = 0; iter < iterations; iter++) {
+ for (i = 0; i < FILE_SIZE; i += page_size) {
+ /* access a byte to trigger page fault */
+ p = &map[i];
+ __asm__ __volatile__("" : : "r"(p) : "memory");
+ }
+
+ if (env.verbosity >= VERBOSE_NORMAL)
+ printf("%s %d %d done\n", __func__, getpid(), iter);
+ }
+
+ if (munmap(map, FILE_SIZE) == -1)
+ goto cleanup_fd;
+
+ ret = 0;
+
+cleanup_fd:
+ close(fd);
+out:
+ return ret;
+}
+
+static int
+real_test_memcg_ops_child_work(const char *cgroup_path,
+ char *data_filename,
+ char *time_filename,
+ int read_times)
+{
+ struct timeval start, end;
+ double elapsed;
+ FILE *fp;
+ int ret = -1;
+
+ if (!ASSERT_OK(join_parent_cgroup(cgroup_path), "join_parent_cgroup"))
+ goto out;
+
+ if (env.verbosity >= VERBOSE_NORMAL)
+ printf("%s %d begin\n", __func__, getpid());
+
+ gettimeofday(&start, NULL);
+
+ if (!ASSERT_OK(write_file(data_filename), "write_file"))
+ goto out;
+
+ if (env.verbosity >= VERBOSE_NORMAL)
+ printf("%s %d write_file done\n", __func__, getpid());
+
+ if (!ASSERT_OK(read_file(data_filename, read_times), "read_file"))
+ goto out;
+
+ gettimeofday(&end, NULL);
+
+ elapsed = (end.tv_sec - start.tv_sec) +
+ (end.tv_usec - start.tv_usec) / 1000000.0;
+
+ if (env.verbosity >= VERBOSE_NORMAL)
+ printf("%s %d end %.6f\n", __func__, getpid(), elapsed);
+
+ fp = fopen(time_filename, "w");
+ if (!ASSERT_OK_PTR(fp, "fopen"))
+ goto out;
+ fprintf(fp, "%.6f", elapsed);
+ fclose(fp);
+
+ ret = 0;
+out:
+ return ret;
+}
+
+static int get_time(char *time_filename, double *time)
+{
+ int ret = -1;
+ FILE *fp;
+ char buf[64];
+
+ fp = fopen(time_filename, "r");
+ if (!ASSERT_OK_PTR(fp, "fopen"))
+ goto out;
+
+ if (!ASSERT_OK_PTR(fgets(buf, sizeof(buf), fp), "fgets"))
+ goto cleanup;
+
+ if (sscanf(buf, "%lf", time) != 1) {
+ PRINT_FAIL("sscanf %s", buf);
+ goto cleanup;
+ }
+
+ ret = 0;
+cleanup:
+ fclose(fp);
+out:
+ return ret;
+}
+
+static void real_test_memcg_ops(int read_times)
+{
+ int ret;
+ char data_file1[] = "/tmp/test_data_1_XXXXXX";
+ char data_file2[] = "/tmp/test_data_2_XXXXXX";
+ char time_file1[] = "/tmp/test_time_1_XXXXXX";
+ char time_file2[] = "/tmp/test_time_2_XXXXXX";
+ pid_t pid1, pid2;
+ double time1, time2;
+ int status;
+
+ ret = mkstemp(data_file1);
+ if (!ASSERT_GE(ret, 0, "mkstemp"))
+ return;
+ close(ret);
+ ret = mkstemp(data_file2);
+ if (!ASSERT_GE(ret, 0, "mkstemp"))
+ goto cleanup_data_file1;
+ close(ret);
+ ret = mkstemp(time_file1);
+ if (!ASSERT_GE(ret, 0, "mkstemp"))
+ goto cleanup_data_file2;
+ close(ret);
+ ret = mkstemp(time_file2);
+ if (!ASSERT_GE(ret, 0, "mkstemp"))
+ goto cleanup_time_file1;
+ close(ret);
+
+ pid1 = fork();
+ if (!ASSERT_GE(pid1, 0, "fork"))
+ goto cleanup;
+ if (pid1 == 0) {
+ exit(real_test_memcg_ops_child_work(CG_LOW_DIR,
+ data_file1,
+ time_file1,
+ read_times));
+ }
+
+ pid2 = fork();
+ if (!ASSERT_GE(pid2, 0, "fork")) {
+ /* Reap first child to avoid a zombie if second fork fails. */
+ (void)waitpid(pid1, NULL, 0);
+ goto cleanup;
+ }
+ if (pid2 == 0) {
+ exit(real_test_memcg_ops_child_work(CG_HIGH_DIR,
+ data_file2,
+ time_file2,
+ read_times));
+ }
+
+ ret = waitpid(pid1, &status, 0);
+ if (!ASSERT_GT(ret, 0, "child1 waitpid"))
+ goto cleanup;
+ if (!ASSERT_TRUE(WIFEXITED(status), "child1 exited normally"))
+ goto cleanup;
+ if (!ASSERT_EQ(WEXITSTATUS(status), 0, "child1 exit status"))
+ goto cleanup;
+
+ ret = waitpid(pid2, &status, 0);
+ if (!ASSERT_GT(ret, 0, "child2 waitpid"))
+ goto cleanup;
+ if (!ASSERT_TRUE(WIFEXITED(status), "child2 exited normally"))
+ goto cleanup;
+ if (!ASSERT_EQ(WEXITSTATUS(status), 0, "child2 exit status"))
+ goto cleanup;
+
+ if (get_time(time_file1, &time1))
+ goto cleanup;
+
+ if (get_time(time_file2, &time2))
+ goto cleanup;
+
+ if (time1 < time2 || time1 - time2 <= 1)
+ PRINT_FAIL("Low priority cgroup not slower: low=%f vs high=%f",
+ time1, time2);
+
+cleanup:
+ unlink(time_file2);
+cleanup_time_file1:
+ unlink(time_file1);
+cleanup_data_file2:
+ unlink(data_file2);
+cleanup_data_file1:
+ unlink(data_file1);
+}
+
+void test_memcg_ops_over_high(void)
+{
+ int err, map_fd;
+ struct memcg_ops *skel = NULL;
+ struct bpf_map *map;
+ struct memcg_ops__bss *bss_data;
+ __u32 key = 0;
+ struct bpf_program *prog = NULL;
+ struct bpf_link *link = NULL, *link2 = NULL;
+ DECLARE_LIBBPF_OPTS(bpf_struct_ops_opts, opts);
+ u64 high_cgroup_id;
+ int low_cgroup_fd = -1;
+
+ err = setup_high_low_cgroups(&high_cgroup_id, &low_cgroup_fd, NULL);
+ if (!ASSERT_OK(err, "setup_high_low_cgroups"))
+ goto out;
+
+ skel = memcg_ops__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "memcg_ops__open_and_load"))
+ goto out;
+
+ map = bpf_object__find_map_by_name(skel->obj, ".bss");
+ if (!ASSERT_OK_PTR(map, "bpf_object__find_map_by_name .bss"))
+ goto out;
+
+ map_fd = bpf_map__fd(map);
+ bss_data = calloc(1, bpf_map__value_size(map));
+ if (!ASSERT_OK_PTR(bss_data, "calloc(1, bpf_map__value_size(map))"))
+ goto out;
+ bss_data->local_config.high_cgroup_id = high_cgroup_id;
+ bss_data->local_config.threshold = TRIGGER_THRESHOLD;
+ bss_data->local_config.use_below_low = false;
+ bss_data->local_config.use_below_min = false;
+ bss_data->local_config.over_high_ms = OVER_HIGH_MS;
+ err = bpf_map_update_elem(map_fd, &key, bss_data, BPF_EXIST);
+ free(bss_data);
+ if (!ASSERT_OK(err, "bpf_map_update_elem"))
+ goto out;
+
+ prog = bpf_object__find_program_by_name(skel->obj,
+ "handle_count_memcg_events");
+ if (!ASSERT_OK_PTR(prog, "bpf_object__find_program_by_name"))
+ goto out;
+
+ link = bpf_program__attach(prog);
+ if (!ASSERT_OK_PTR(link, "bpf_program__attach"))
+ goto out;
+
+ map = bpf_object__find_map_by_name(skel->obj, "low_mcg_ops");
+ if (!ASSERT_OK_PTR(map, "bpf_object__find_map_by_name low_mcg_ops"))
+ goto out;
+
+ opts.flags = BPF_F_CGROUP_FD;
+ opts.target_fd = low_cgroup_fd;
+ link2 = bpf_map__attach_struct_ops_opts(map, &opts);
+ if (!ASSERT_OK_PTR(link2, "bpf_map__attach_struct_ops_opts"))
+ goto out;
+
+ real_test_memcg_ops(5);
+
+out:
+ bpf_link__destroy(link);
+ bpf_link__destroy(link2);
+ if (skel) {
+ memcg_ops__detach(skel);
+ memcg_ops__destroy(skel);
+ }
+ close(low_cgroup_fd);
+ cleanup_cgroup_environment();
+}
+
+void test_memcg_ops_below_low_over_high(void)
+{
+ int err, map_fd;
+ struct memcg_ops *skel = NULL;
+ struct bpf_map *map;
+ struct memcg_ops__bss *bss_data;
+ __u32 key = 0;
+ struct bpf_program *prog = NULL;
+ struct bpf_link *link = NULL, *link_high = NULL, *link_low = NULL;
+ DECLARE_LIBBPF_OPTS(bpf_struct_ops_opts, opts);
+ u64 high_cgroup_id;
+ int high_cgroup_fd = -1, low_cgroup_fd = -1;
+
+ err = setup_high_low_cgroups(&high_cgroup_id, &low_cgroup_fd,
+ &high_cgroup_fd);
+ if (!ASSERT_OK(err, "setup_high_low_cgroups"))
+ goto out;
+
+ skel = memcg_ops__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "memcg_ops__open_and_load"))
+ goto out;
+
+ map = bpf_object__find_map_by_name(skel->obj, ".bss");
+ if (!ASSERT_OK_PTR(map, "bpf_object__find_map_by_name .bss"))
+ goto out;
+
+ map_fd = bpf_map__fd(map);
+ bss_data = calloc(1, bpf_map__value_size(map));
+ if (!ASSERT_OK_PTR(bss_data, "calloc(1, bpf_map__value_size(map))"))
+ goto out;
+ bss_data->local_config.high_cgroup_id = high_cgroup_id;
+ bss_data->local_config.threshold = TRIGGER_THRESHOLD;
+ bss_data->local_config.use_below_low = true;
+ bss_data->local_config.use_below_min = false;
+ bss_data->local_config.over_high_ms = OVER_HIGH_MS;
+ err = bpf_map_update_elem(map_fd, &key, bss_data, BPF_EXIST);
+ free(bss_data);
+ if (!ASSERT_OK(err, "bpf_map_update_elem"))
+ goto out;
+
+ prog = bpf_object__find_program_by_name(skel->obj,
+ "handle_count_memcg_events");
+ if (!ASSERT_OK_PTR(prog, "bpf_object__find_program_by_name"))
+ goto out;
+
+ link = bpf_program__attach(prog);
+ if (!ASSERT_OK_PTR(link, "bpf_program__attach"))
+ goto out;
+
+ map = bpf_object__find_map_by_name(skel->obj, "high_mcg_ops");
+ if (!ASSERT_OK_PTR(map, "bpf_object__find_map_by_name high_mcg_ops"))
+ goto out;
+ opts.flags = BPF_F_CGROUP_FD;
+ opts.target_fd = high_cgroup_fd;
+ link_high = bpf_map__attach_struct_ops_opts(map, &opts);
+ if (!ASSERT_OK_PTR(link_high, "bpf_map__attach_struct_ops_opts"))
+ goto out;
+
+ map = bpf_object__find_map_by_name(skel->obj, "low_mcg_ops");
+ if (!ASSERT_OK_PTR(map, "bpf_object__find_map_by_name low_mcg_ops"))
+ goto out;
+ opts.target_fd = low_cgroup_fd;
+ link_low = bpf_map__attach_struct_ops_opts(map, &opts);
+ if (!ASSERT_OK_PTR(link_low, "bpf_map__attach_struct_ops_opts"))
+ goto out;
+
+ real_test_memcg_ops(50);
+
+out:
+ bpf_link__destroy(link);
+ bpf_link__destroy(link_high);
+ bpf_link__destroy(link_low);
+ if (skel) {
+ memcg_ops__detach(skel);
+ memcg_ops__destroy(skel);
+ }
+ close(high_cgroup_fd);
+ close(low_cgroup_fd);
+ cleanup_cgroup_environment();
+}
+
+void test_memcg_ops_below_min_over_high(void)
+{
+ int err, map_fd;
+ struct memcg_ops *skel = NULL;
+ struct bpf_map *map;
+ struct memcg_ops__bss *bss_data;
+ __u32 key = 0;
+ struct bpf_program *prog = NULL;
+ struct bpf_link *link = NULL, *link_high = NULL, *link_low = NULL;
+ DECLARE_LIBBPF_OPTS(bpf_struct_ops_opts, opts);
+ u64 high_cgroup_id;
+ int high_cgroup_fd = -1, low_cgroup_fd = -1;
+
+ err = setup_high_low_cgroups(&high_cgroup_id, &low_cgroup_fd,
+ &high_cgroup_fd);
+ if (!ASSERT_OK(err, "setup_high_low_cgroups"))
+ goto out;
+
+ skel = memcg_ops__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "memcg_ops__open_and_load"))
+ goto out;
+
+ map = bpf_object__find_map_by_name(skel->obj, ".bss");
+ if (!ASSERT_OK_PTR(map, "bpf_object__find_map_by_name .bss"))
+ goto out;
+
+ map_fd = bpf_map__fd(map);
+ bss_data = calloc(1, bpf_map__value_size(map));
+ if (!ASSERT_OK_PTR(bss_data, "calloc(1, bpf_map__value_size(map))"))
+ goto out;
+ bss_data->local_config.high_cgroup_id = high_cgroup_id;
+ bss_data->local_config.threshold = TRIGGER_THRESHOLD;
+ bss_data->local_config.use_below_low = false;
+ bss_data->local_config.use_below_min = true;
+ bss_data->local_config.over_high_ms = OVER_HIGH_MS;
+ err = bpf_map_update_elem(map_fd, &key, bss_data, BPF_EXIST);
+ free(bss_data);
+ if (!ASSERT_OK(err, "bpf_map_update_elem"))
+ goto out;
+
+ prog = bpf_object__find_program_by_name(skel->obj,
+ "handle_count_memcg_events");
+ if (!ASSERT_OK_PTR(prog, "bpf_object__find_program_by_name"))
+ goto out;
+
+ link = bpf_program__attach(prog);
+ if (!ASSERT_OK_PTR(link, "bpf_program__attach"))
+ goto out;
+
+ map = bpf_object__find_map_by_name(skel->obj, "high_mcg_ops");
+ if (!ASSERT_OK_PTR(map, "bpf_object__find_map_by_name high_mcg_ops"))
+ goto out;
+ opts.flags = BPF_F_CGROUP_FD;
+ opts.target_fd = high_cgroup_fd;
+ link_high = bpf_map__attach_struct_ops_opts(map, &opts);
+ if (!ASSERT_OK_PTR(link_high, "bpf_map__attach_struct_ops_opts"))
+ goto out;
+
+ map = bpf_object__find_map_by_name(skel->obj, "low_mcg_ops");
+ if (!ASSERT_OK_PTR(map, "bpf_object__find_map_by_name low_mcg_ops"))
+ goto out;
+ opts.target_fd = low_cgroup_fd;
+ link_low = bpf_map__attach_struct_ops_opts(map, &opts);
+ if (!ASSERT_OK_PTR(link_low, "bpf_map__attach_struct_ops_opts"))
+ goto out;
+
+ real_test_memcg_ops(50);
+
+out:
+ bpf_link__destroy(link);
+ bpf_link__destroy(link_high);
+ bpf_link__destroy(link_low);
+ if (skel) {
+ memcg_ops__detach(skel);
+ memcg_ops__destroy(skel);
+ }
+ close(high_cgroup_fd);
+ close(low_cgroup_fd);
+ cleanup_cgroup_environment();
+}
diff --git a/tools/testing/selftests/bpf/progs/memcg_ops.c b/tools/testing/selftests/bpf/progs/memcg_ops.c
new file mode 100644
index 000000000000..4a1d817c1d9c
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/memcg_ops.c
@@ -0,0 +1,132 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+#define ONE_SECOND_NS 1000000000
+
+struct local_config {
+ u64 threshold;
+ u64 high_cgroup_id;
+ bool use_below_low;
+ bool use_below_min;
+ unsigned int over_high_ms;
+} local_config;
+
+struct AggregationData {
+ u64 sum;
+ u64 window_start_ts;
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY);
+ __uint(max_entries, 1);
+ __type(key, u32);
+ __type(value, struct AggregationData);
+} aggregation_map SEC(".maps");
+
+struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY);
+ __uint(max_entries, 1);
+ __type(key, u32);
+ __type(value, u64);
+} trigger_ts_map SEC(".maps");
+
+SEC("tp/memcg/count_memcg_events")
+int
+handle_count_memcg_events(struct trace_event_raw_memcg_rstat_events *ctx)
+{
+ u32 key = 0;
+ struct AggregationData *data;
+ u64 current_ts;
+
+ if (ctx->id != local_config.high_cgroup_id ||
+ (ctx->item != PGFAULT))
+ goto out;
+
+ data = bpf_map_lookup_elem(&aggregation_map, &key);
+ if (!data)
+ goto out;
+
+ current_ts = bpf_ktime_get_ns();
+
+ if (current_ts - data->window_start_ts < ONE_SECOND_NS) {
+ data->sum += ctx->val;
+ } else {
+ data->window_start_ts = current_ts;
+ data->sum = ctx->val;
+ }
+
+ if (data->sum > local_config.threshold) {
+ bpf_map_update_elem(&trigger_ts_map, &key, ¤t_ts,
+ BPF_ANY);
+ data->sum = 0;
+ data->window_start_ts = current_ts;
+ }
+
+out:
+ return 0;
+}
+
+static bool need_threshold(void)
+{
+ u32 key = 0;
+ u64 *trigger_ts;
+ bool ret = false;
+ u64 current_ts;
+
+ trigger_ts = bpf_map_lookup_elem(&trigger_ts_map, &key);
+ if (!trigger_ts || *trigger_ts == 0)
+ goto out;
+
+ current_ts = bpf_ktime_get_ns();
+
+ if (current_ts - *trigger_ts < ONE_SECOND_NS)
+ ret = true;
+
+out:
+ return ret;
+}
+
+SEC("struct_ops/below_low")
+bool below_low_impl(struct mem_cgroup *memcg, unsigned long elow,
+ unsigned long usage)
+{
+ if (!local_config.use_below_low)
+ return false;
+
+ return need_threshold();
+}
+
+SEC("struct_ops/below_min")
+bool below_min_impl(struct mem_cgroup *memcg, unsigned long emin,
+ unsigned long usage)
+{
+ if (!local_config.use_below_min)
+ return false;
+
+ return need_threshold();
+}
+
+SEC("struct_ops/memcg_charged")
+unsigned int memcg_charged_impl(struct mem_cgroup *memcg, unsigned int nr_pages)
+{
+ if (local_config.over_high_ms && need_threshold())
+ return local_config.over_high_ms;
+
+ return 0;
+}
+
+SEC(".struct_ops.link")
+struct memcg_bpf_ops high_mcg_ops = {
+ .below_low = (void *)below_low_impl,
+ .below_min = (void *)below_min_impl,
+};
+
+SEC(".struct_ops.link")
+struct memcg_bpf_ops low_mcg_ops = {
+ .memcg_charged = (void *)memcg_charged_impl,
+};
+
+char LICENSE[] SEC("license") = "GPL";
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC PATCH bpf-next v7 09/11] selftests/bpf: Add test for memcg_bpf_ops hierarchies
2026-05-26 2:20 [RFC PATCH bpf-next v7 00/11] mm: BPF struct_ops for dynamic memory protection and async reclaim Hui Zhu
` (7 preceding siblings ...)
2026-05-26 2:24 ` [RFC PATCH bpf-next v7 08/11] selftests/bpf: Add tests for memcg_bpf_ops Hui Zhu
@ 2026-05-26 2:27 ` Hui Zhu
2026-05-26 2:27 ` [RFC PATCH bpf-next v7 10/11] selftests/bpf: Add selftest for memcg async reclaim via BPF Hui Zhu
2026-05-26 2:27 ` [RFC PATCH bpf-next v7 11/11] samples/bpf: Add memcg priority control and async reclaim example Hui Zhu
10 siblings, 0 replies; 18+ messages in thread
From: Hui Zhu @ 2026-05-26 2:27 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, John Fastabend,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa,
Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Muchun Song, JP Kobryn, Andrew Morton, Shuah Khan, davem,
Jakub Kicinski, Jesper Dangaard Brouer, Stanislav Fomichev,
KP Singh, Tao Chen, Mykyta Yatsenko, Leon Hwang, Anton Protopopov,
Amery Hung, Tobias Klauser, Eyal Birger, Rong Tao, Hao Luo,
Peter Zijlstra, Miguel Ojeda, Nathan Chancellor, Kees Cook,
Tejun Heo, Jeff Xu, mkoutny, Jan Hendrik Farr, Christian Brauner,
Randy Dunlap, Brian Gerst, Masahiro Yamada, Willem de Bruijn,
Jason Xing, Paul Chaignon, Chen Ridong, Lance Yang, Jiayuan Chen,
linux-kernel, bpf, cgroups, linux-mm, netdev, linux-kselftest
Cc: geliang, baohua, Hui Zhu
From: Hui Zhu <zhuhui@kylinos.cn>
Add a new selftest, `test_memcg_ops_hierarchies`, to validate the
behavior of attaching `memcg_bpf_ops` in a nested cgroup hierarchy,
specifically testing the `BPF_F_ALLOW_OVERRIDE` flag.
The test case performs the following steps:
1. Creates a three-level deep cgroup hierarchy: `/cg`, `/cg/cg`, and
`/cg/cg/cg`.
2. Attaches a BPF struct_ops to the top-level cgroup (`/cg`) with the
`BPF_F_ALLOW_OVERRIDE` flag.
3. Successfully attaches a new struct_ops to the middle cgroup
(`/cg/cg`) without the flag, overriding the inherited one.
4. Asserts that attaching another struct_ops to the deepest cgroup
(`/cg/cg/cg`) fails with -EBUSY, because its parent did not specify
`BPF_F_ALLOW_OVERRIDE`.
This test ensures that the attachment logic correctly enforces the
override rules across a cgroup subtree.
Signed-off-by: Barry Song <baohua@kernel.org>
Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
---
.../selftests/bpf/prog_tests/memcg_ops.c | 73 +++++++++++++++++++
1 file changed, 73 insertions(+)
diff --git a/tools/testing/selftests/bpf/prog_tests/memcg_ops.c b/tools/testing/selftests/bpf/prog_tests/memcg_ops.c
index 19fd4fde2266..b4084e9327eb 100644
--- a/tools/testing/selftests/bpf/prog_tests/memcg_ops.c
+++ b/tools/testing/selftests/bpf/prog_tests/memcg_ops.c
@@ -559,3 +559,76 @@ void test_memcg_ops_below_min_over_high(void)
close(low_cgroup_fd);
cleanup_cgroup_environment();
}
+
+void test_memcg_ops_hierarchies(void)
+{
+ int ret, first = -1, second = -1, third = -1;
+ struct memcg_ops *skel = NULL;
+ struct bpf_map *map;
+ struct bpf_link *link1 = NULL, *link2 = NULL, *link3 = NULL;
+ DECLARE_LIBBPF_OPTS(bpf_struct_ops_opts, opts);
+
+ ret = setup_cgroup_environment();
+ if (!ASSERT_OK(ret, "setup_cgroup_environment"))
+ goto cleanup;
+
+ first = create_and_get_cgroup("/cg");
+ if (!ASSERT_GE(first, 0, "create_and_get_cgroup /cg"))
+ goto cleanup;
+ ret = enable_controllers("/cg", "memory");
+ if (!ASSERT_OK(ret, "enable_controllers"))
+ goto cleanup;
+
+ second = create_and_get_cgroup("/cg/cg");
+ if (!ASSERT_GE(second, 0, "create_and_get_cgroup /cg/cg"))
+ goto cleanup;
+ ret = enable_controllers("/cg/cg", "memory");
+ if (!ASSERT_OK(ret, "enable_controllers"))
+ goto cleanup;
+
+ third = create_and_get_cgroup("/cg/cg/cg");
+ if (!ASSERT_GE(third, 0, "create_and_get_cgroup /cg/cg/cg"))
+ goto cleanup;
+ ret = enable_controllers("/cg/cg/cg", "memory");
+ if (!ASSERT_OK(ret, "enable_controllers"))
+ goto cleanup;
+
+ skel = memcg_ops__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "memcg_ops__open_and_load"))
+ goto cleanup;
+
+ map = bpf_object__find_map_by_name(skel->obj, "low_mcg_ops");
+ if (!ASSERT_OK_PTR(map, "bpf_object__find_map_by_name low_mcg_ops"))
+ goto cleanup;
+
+ opts.target_fd = first;
+ opts.flags = BPF_F_ALLOW_OVERRIDE | BPF_F_CGROUP_FD;
+ link1 = bpf_map__attach_struct_ops_opts(map, &opts);
+ if (!ASSERT_OK_PTR(link1, "bpf_map__attach_struct_ops_opts"))
+ goto cleanup;
+
+ opts.target_fd = second;
+ opts.flags = BPF_F_CGROUP_FD;
+ link2 = bpf_map__attach_struct_ops_opts(map, &opts);
+ if (!ASSERT_OK_PTR(link2, "bpf_map__attach_struct_ops_opts"))
+ goto cleanup;
+
+ opts.target_fd = third;
+ opts.flags = BPF_F_CGROUP_FD;
+ link3 = bpf_map__attach_struct_ops_opts(map, &opts);
+ if (!ASSERT_ERR_PTR(link3, "bpf_map__attach_struct_ops_opts"))
+ goto cleanup;
+
+cleanup:
+ bpf_link__destroy(link1);
+ bpf_link__destroy(link2);
+ bpf_link__destroy(link3);
+ if (skel) {
+ memcg_ops__detach(skel);
+ memcg_ops__destroy(skel);
+ }
+ close(first);
+ close(second);
+ close(third);
+ cleanup_cgroup_environment();
+}
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC PATCH bpf-next v7 10/11] selftests/bpf: Add selftest for memcg async reclaim via BPF
2026-05-26 2:20 [RFC PATCH bpf-next v7 00/11] mm: BPF struct_ops for dynamic memory protection and async reclaim Hui Zhu
` (8 preceding siblings ...)
2026-05-26 2:27 ` [RFC PATCH bpf-next v7 09/11] selftests/bpf: Add test for memcg_bpf_ops hierarchies Hui Zhu
@ 2026-05-26 2:27 ` Hui Zhu
2026-05-26 3:06 ` bot+bpf-ci
2026-05-26 2:27 ` [RFC PATCH bpf-next v7 11/11] samples/bpf: Add memcg priority control and async reclaim example Hui Zhu
10 siblings, 1 reply; 18+ messages in thread
From: Hui Zhu @ 2026-05-26 2:27 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, John Fastabend,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa,
Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Muchun Song, JP Kobryn, Andrew Morton, Shuah Khan, davem,
Jakub Kicinski, Jesper Dangaard Brouer, Stanislav Fomichev,
KP Singh, Tao Chen, Mykyta Yatsenko, Leon Hwang, Anton Protopopov,
Amery Hung, Tobias Klauser, Eyal Birger, Rong Tao, Hao Luo,
Peter Zijlstra, Miguel Ojeda, Nathan Chancellor, Kees Cook,
Tejun Heo, Jeff Xu, mkoutny, Jan Hendrik Farr, Christian Brauner,
Randy Dunlap, Brian Gerst, Masahiro Yamada, Willem de Bruijn,
Jason Xing, Paul Chaignon, Chen Ridong, Lance Yang, Jiayuan Chen,
linux-kernel, bpf, cgroups, linux-mm, netdev, linux-kselftest
Cc: geliang, baohua, Hui Zhu
From: Hui Zhu <zhuhui@kylinos.cn>
Add a BPF selftest that demonstrates and validates asynchronous memory
reclaim for a memory cgroup using BPF struct_ops and the BPF workqueue
mechanism.
The BPF program (progs/memcg_async_reclaim.c) registers struct_ops
callbacks for memcg_charged and memcg_uncharged to track the memory
charge/uncharge events of a target cgroup. When accumulated memory
usage exceeds a configured threshold, the memcg_charged callback
enqueues an asynchronous workqueue item via bpf_wq_start(). The
workqueue callback then invokes bpf_try_to_free_mem_cgroup_pages() to
reclaim pages from the target memcg without blocking the charging
context.
The test (prog_tests/memcg_async_reclaim.c) verifies the effectiveness
of this mechanism by:
1. Running a memory workload (sequential file write + mmap read)
without the BPF async reclaim program attached, and asserting that
the memcg "max" event counter increases, confirming that the cgroup
memory limit is being hit.
2. Repeating the same workload with the BPF async reclaim program
active, and asserting that the "max" event counter does NOT
increase, confirming that proactive async reclaim successfully
kept memory usage below the hard limit.
A new helper read_cgroup_file() is added to cgroup_helpers.c to
support reading memcg interface files (e.g. memory.events) from within
the test infrastructure.
The new test files are also registered in MAINTAINERS under the Memory
Controller section.
Signed-off-by: Barry Song <baohua@kernel.org>
Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
---
MAINTAINERS | 2 +
tools/testing/selftests/bpf/cgroup_helpers.c | 41 +++
tools/testing/selftests/bpf/cgroup_helpers.h | 2 +
.../bpf/prog_tests/memcg_async_reclaim.c | 333 ++++++++++++++++++
.../selftests/bpf/progs/memcg_async_reclaim.c | 203 +++++++++++
5 files changed, 581 insertions(+)
create mode 100644 tools/testing/selftests/bpf/prog_tests/memcg_async_reclaim.c
create mode 100644 tools/testing/selftests/bpf/progs/memcg_async_reclaim.c
diff --git a/MAINTAINERS b/MAINTAINERS
index 1be243e544da..b2e64ef8c60c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6567,7 +6567,9 @@ F: mm/memcontrol-v1.h
F: mm/page_counter.c
F: mm/swap_cgroup.c
F: samples/cgroup/*
+F: tools/testing/selftests/bpf/prog_tests/memcg_async_reclaim.c
F: tools/testing/selftests/bpf/prog_tests/memcg_ops.c
+F: tools/testing/selftests/bpf/progs/memcg_async_reclaim.c
F: tools/testing/selftests/bpf/progs/memcg_ops.c
F: tools/testing/selftests/cgroup/memcg_protection.m
F: tools/testing/selftests/cgroup/test_hugetlb_memcg.c
diff --git a/tools/testing/selftests/bpf/cgroup_helpers.c b/tools/testing/selftests/bpf/cgroup_helpers.c
index 45cd0b479fe3..22420d2f5199 100644
--- a/tools/testing/selftests/bpf/cgroup_helpers.c
+++ b/tools/testing/selftests/bpf/cgroup_helpers.c
@@ -167,6 +167,47 @@ int write_cgroup_file(const char *relative_path, const char *file,
return __write_cgroup_file(cgroup_path, file, buf);
}
+/**
+ * read_cgroup_file() - Read content from a cgroup file
+ * @relative_path: The cgroup path, relative to the workdir
+ * @file: The name of the file in cgroupfs to read from
+ * @buf: Buffer to store the read data
+ * @buf_size: Size of the buffer
+ *
+ * Read the entire content of a cgroup file into the provided buffer.
+ * The buffer will be null-terminated on success.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+int read_cgroup_file(const char *relative_path, const char *file,
+ char *buf, size_t buf_size)
+{
+ char cgroup_path[PATH_MAX - 24];
+ char file_path[PATH_MAX + 1];
+ int fd;
+ ssize_t len;
+
+ if (!relative_path || !file || !buf || buf_size == 0)
+ return -EINVAL;
+
+ format_cgroup_path(cgroup_path, relative_path);
+ snprintf(file_path, sizeof(file_path), "%s/%s", cgroup_path, file);
+
+ fd = open(file_path, O_RDONLY);
+ if (fd < 0)
+ return -errno;
+
+ len = read(fd, buf, buf_size - 1);
+ if (len < 0) {
+ close(fd);
+ return -errno;
+ }
+ close(fd);
+
+ buf[len] = '\0';
+ return 0;
+}
+
/**
* write_cgroup_file_parent() - Write to a cgroup file in the parent process
* workdir
diff --git a/tools/testing/selftests/bpf/cgroup_helpers.h b/tools/testing/selftests/bpf/cgroup_helpers.h
index 3857304be874..d722e8ff8dee 100644
--- a/tools/testing/selftests/bpf/cgroup_helpers.h
+++ b/tools/testing/selftests/bpf/cgroup_helpers.h
@@ -13,6 +13,8 @@
int enable_controllers(const char *relative_path, const char *controllers);
int write_cgroup_file(const char *relative_path, const char *file,
const char *buf);
+int read_cgroup_file(const char *relative_path, const char *file,
+ char *buf, size_t buf_size);
int write_cgroup_file_parent(const char *relative_path, const char *file,
const char *buf);
int cgroup_setup_and_join(const char *relative_path);
diff --git a/tools/testing/selftests/bpf/prog_tests/memcg_async_reclaim.c b/tools/testing/selftests/bpf/prog_tests/memcg_async_reclaim.c
new file mode 100644
index 000000000000..bf25967c911c
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/memcg_async_reclaim.c
@@ -0,0 +1,333 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Memory controller eBPF async reclaim test
+ */
+
+#include <test_progs.h>
+#include <bpf/btf.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <linux/limits.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <stdint.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+
+#include "cgroup_helpers.h"
+
+struct bpf_args_s {
+ u64 cgroup_id;
+ u64 limit_bytes;
+};
+
+#include "memcg_async_reclaim.skel.h"
+
+#define FILE_SIZE (64 * 1024 * 1024ul)
+#define BUFFER_SIZE (4096)
+#define CG_LIMIT (32 * 1024 * 1024ul)
+#define CG_DIR1 "/memcg_async_reclaim1"
+#define CG_DIR2 "/memcg_async_reclaim2"
+#define RECLAIM_TRIGGER_SIZE (12 * 1024 * 1024ul)
+
+static int
+setup_max_cgroup(const char *cg_path, u64 cg_max, u64 *cgroup_id,
+ int *cgroup_fd)
+{
+ int ret;
+ char limit_buf[20];
+
+ *cgroup_fd = create_and_get_cgroup(cg_path);
+ if (!ASSERT_GE(*cgroup_fd, 0, "create_and_get_cgroup"))
+ goto cleanup;
+
+ *cgroup_id = get_cgroup_id(cg_path);
+ if (!ASSERT_GT(*cgroup_id, 0, "get_cgroup_id"))
+ goto cleanup;
+
+ snprintf(limit_buf, 20, "%lu", cg_max);
+ ret = write_cgroup_file(cg_path, "memory.max", limit_buf);
+ if (!ASSERT_OK(ret, "write_cgroup_file memory.max"))
+ goto cleanup;
+
+ ret = write_cgroup_file(cg_path, "memory.swap.max", "0");
+ if (!ASSERT_OK(ret, "write_cgroup_file memory.swap.max"))
+ goto cleanup;
+
+ return ret;
+
+cleanup:
+ close(*cgroup_fd);
+ cleanup_cgroup_environment();
+ return -1;
+}
+
+static int
+setup_bpf(u64 cg_id, int cg_fd, u64 limit_bytes,
+ struct memcg_async_reclaim **skel_ptr, struct bpf_link **link_ptr)
+{
+ struct memcg_async_reclaim *skel;
+ struct bpf_map *map;
+ struct bpf_link *link = NULL;
+ DECLARE_LIBBPF_OPTS(bpf_struct_ops_opts, opts);
+ struct bpf_args_s bpf_args = {
+ .limit_bytes = limit_bytes,
+ .cgroup_id = cg_id,
+ };
+ LIBBPF_OPTS(bpf_test_run_opts, run_opts,
+ .ctx_in = &bpf_args,
+ .ctx_size_in = sizeof(bpf_args)
+ );
+ int prog_init_fd;
+
+ skel = memcg_async_reclaim__open_and_load();
+ if (!ASSERT_OK_PTR(skel, "memcg_async_reclaim__open_and_load"))
+ goto error;
+
+ prog_init_fd = bpf_program__fd(skel->progs.prog_init);
+ if (!ASSERT_GE(prog_init_fd, 0, "bpf_program__fd"))
+ goto destroy_skel;
+ if (!ASSERT_OK((bpf_prog_test_run_opts(prog_init_fd, &run_opts) ||
+ run_opts.retval), "bpf_prog_test_run_opts"))
+ goto destroy_skel;
+
+ map = bpf_object__find_map_by_name(skel->obj, "mcg_ops");
+ if (!ASSERT_OK_PTR(map, "bpf_object__find_map_by_name mcg_ops"))
+ goto destroy_skel;
+ opts.flags = BPF_F_CGROUP_FD;
+ opts.target_fd = cg_fd;
+ link = bpf_map__attach_struct_ops_opts(map, &opts);
+ if (!ASSERT_OK_PTR(link, "bpf_map__attach_struct_ops_opts"))
+ goto destroy_skel;
+
+ *link_ptr = link;
+ *skel_ptr = skel;
+ return 0;
+
+destroy_skel:
+ memcg_async_reclaim__destroy(skel);
+error:
+ return -1;
+}
+
+static int write_file(const char *filename)
+{
+ int ret = -1;
+ size_t written = 0;
+ char *buffer;
+ FILE *fp;
+
+ fp = fopen(filename, "wb");
+ if (!fp)
+ goto out;
+
+ buffer = malloc(BUFFER_SIZE);
+ if (!buffer)
+ goto cleanup_fp;
+
+ memset(buffer, 'A', BUFFER_SIZE);
+
+ while (written < FILE_SIZE) {
+ size_t to_write = (FILE_SIZE - written < BUFFER_SIZE) ?
+ (FILE_SIZE - written) :
+ BUFFER_SIZE;
+
+ if (fwrite(buffer, 1, to_write, fp) != to_write)
+ goto cleanup;
+ written += to_write;
+ }
+
+ ret = 0;
+cleanup:
+ free(buffer);
+cleanup_fp:
+ fclose(fp);
+out:
+ return ret;
+}
+
+static int read_file(const char *filename, int iterations)
+{
+ int ret = -1;
+ long page_size = sysconf(_SC_PAGESIZE);
+ char *p;
+ char *map;
+ size_t i;
+ int fd;
+ struct stat sb;
+
+ fd = open(filename, O_RDONLY);
+ if (fd == -1)
+ goto out;
+
+ if (fstat(fd, &sb) == -1)
+ goto cleanup_fd;
+
+ if (sb.st_size != FILE_SIZE) {
+ fprintf(stderr, "File size mismatch: expected %lu, got %lu\n",
+ (unsigned long)FILE_SIZE, (unsigned long)sb.st_size);
+ goto cleanup_fd;
+ }
+
+ map = mmap(NULL, FILE_SIZE, PROT_READ, MAP_PRIVATE, fd, 0);
+ if (map == MAP_FAILED)
+ goto cleanup_fd;
+
+ for (int iter = 0; iter < iterations; iter++) {
+ for (i = 0; i < FILE_SIZE; i += page_size) {
+ /* access a byte to trigger page fault */
+ p = &map[i];
+ __asm__ __volatile__("" : : "r"(p) : "memory");
+ }
+
+ if (env.verbosity >= VERBOSE_NORMAL)
+ printf("%s %d %d done\n", __func__, getpid(), iter);
+ }
+
+ if (munmap(map, FILE_SIZE) == -1)
+ goto cleanup_fd;
+
+ ret = 0;
+
+cleanup_fd:
+ close(fd);
+out:
+ return ret;
+}
+
+int get_cgroup_memory_event(const char *relative_path, const char *key,
+ u64 *value)
+{
+ char buf[1024];
+ char *line, *saveptr1;
+ char *c, *saveptr2;
+ char *val_str = NULL;
+ bool found = false;
+ int ret, i;
+
+ if (!key || !value)
+ return -EINVAL;
+
+ ret = read_cgroup_file(relative_path, "memory.events",
+ buf, sizeof(buf));
+ if (ret < 0)
+ return ret;
+
+ for (line = strtok_r(buf, "\n", &saveptr1); line;
+ line = strtok_r(NULL, "\n", &saveptr1)) {
+ val_str = NULL;
+ i = 0;
+
+ for (c = strtok_r(line, " ", &saveptr2); c;
+ c = strtok_r(NULL, " ", &saveptr2)) {
+ if (i == 0) {
+ if (strcmp(c, key) != 0)
+ break;
+ } else if (i == 1) {
+ val_str = c;
+ break;
+ }
+ i++;
+ }
+
+ if (val_str) {
+ char *endptr;
+ u64 v;
+
+ v = strtoull(val_str, &endptr, 10);
+ if (endptr == val_str)
+ return -EINVAL;
+
+ *value = v;
+ found = true;
+ break;
+ }
+ }
+
+ if (!found)
+ return -ENOENT;
+
+ return 0;
+}
+
+void test_memcg_async_reclaim(void)
+{
+ u64 cgroup_id, old_max, new_max;
+ int cgroup_fd, ret;
+ struct memcg_async_reclaim *skel;
+ struct bpf_link *link = NULL;
+ char data_file1[] = "/tmp/test_data_1_XXXXXX";
+ char data_file2[] = "/tmp/test_data_2_XXXXXX";
+
+ if (!ASSERT_OK(setup_cgroup_environment(), "setup_cgroup_environment"))
+ return;
+
+ // test without async_reclaim
+ if (!ASSERT_OK(setup_max_cgroup(CG_DIR1, CG_LIMIT, &cgroup_id,
+ &cgroup_fd), "setup_max_cgroup"))
+ goto cleanup_cgroup;
+ if (!ASSERT_OK(join_cgroup(CG_DIR1), "join_cgroup"))
+ goto close_cgroup_fd;
+ ret = mkstemp(data_file1);
+ if (!ASSERT_GE(ret, 0, "mkstemp"))
+ goto close_cgroup_fd;
+ close(ret);
+
+ if (!ASSERT_OK(get_cgroup_memory_event(CG_DIR1, "max", &old_max),
+ "get_cgroup_memory_event"))
+ goto cleanup_data_file1;
+ if (!ASSERT_OK(write_file(data_file1), "write_file"))
+ goto cleanup_data_file1;
+ if (!ASSERT_OK(read_file(data_file1, 2), "read_file"))
+ goto cleanup_data_file1;
+ if (!ASSERT_OK(get_cgroup_memory_event(CG_DIR1, "max", &new_max),
+ "get_cgroup_memory_event"))
+ goto cleanup_data_file1;
+ if (!ASSERT_GT(new_max, old_max, "memcg max event not trigger"))
+ goto cleanup_data_file1;
+
+ // test with async_reclaim
+ close(cgroup_fd);
+ if (!ASSERT_OK(setup_max_cgroup(CG_DIR2, CG_LIMIT, &cgroup_id,
+ &cgroup_fd), "setup_max_cgroup"))
+ goto cleanup_data_file1;
+ if (!ASSERT_OK(join_cgroup(CG_DIR2), "join_cgroup"))
+ goto cleanup_data_file1;
+ ret = mkstemp(data_file2);
+ if (!ASSERT_GE(ret, 0, "mkstemp"))
+ goto cleanup_data_file1;
+ close(ret);
+
+ if (!ASSERT_OK(setup_bpf(cgroup_id, cgroup_fd, RECLAIM_TRIGGER_SIZE,
+ &skel, &link),
+ "setup_bpf"))
+ goto cleanup_data_file2;
+ if (!ASSERT_OK(get_cgroup_memory_event(CG_DIR2, "max", &old_max),
+ "get_cgroup_memory_event"))
+ goto cleanup;
+ if (!ASSERT_OK(write_file(data_file2), "write_file"))
+ goto cleanup;
+ if (!ASSERT_OK(read_file(data_file2, 2), "read_file"))
+ goto cleanup;
+ if (!ASSERT_OK(get_cgroup_memory_event(CG_DIR2, "max", &new_max),
+ "get_cgroup_memory_event"))
+ goto cleanup;
+ if (!ASSERT_EQ(new_max, old_max, "memcg max event triggered"))
+ goto cleanup;
+
+cleanup:
+ bpf_link__destroy(link);
+ memcg_async_reclaim__detach(skel);
+ memcg_async_reclaim__destroy(skel);
+cleanup_data_file2:
+ unlink(data_file2);
+cleanup_data_file1:
+ unlink(data_file1);
+close_cgroup_fd:
+ close(cgroup_fd);
+cleanup_cgroup:
+ cleanup_cgroup_environment();
+}
diff --git a/tools/testing/selftests/bpf/progs/memcg_async_reclaim.c b/tools/testing/selftests/bpf/progs/memcg_async_reclaim.c
new file mode 100644
index 000000000000..4e66766eb4a3
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/memcg_async_reclaim.c
@@ -0,0 +1,203 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf_atomic.h>
+
+#define BIT(nr) (1UL << (nr))
+
+#define ___GFP_IO BIT(___GFP_IO_BIT)
+#define ___GFP_FS BIT(___GFP_FS_BIT)
+#define ___GFP_DIRECT_RECLAIM BIT(___GFP_DIRECT_RECLAIM_BIT)
+#define ___GFP_KSWAPD_RECLAIM BIT(___GFP_KSWAPD_RECLAIM_BIT)
+
+#define __GFP_IO ((gfp_t)___GFP_IO)
+#define __GFP_FS ((gfp_t)___GFP_FS)
+#define __GFP_DIRECT_RECLAIM ((gfp_t)___GFP_DIRECT_RECLAIM) /* Caller can reclaim */
+#define __GFP_KSWAPD_RECLAIM ((gfp_t)___GFP_KSWAPD_RECLAIM) /* kswapd can wake */
+#define __GFP_RECLAIM ((gfp_t)(___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM))
+
+#define GFP_KERNEL (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
+
+#define ONE_MB_PAGE_COUNT 256
+
+struct bpf_args_s {
+ u64 cgroup_id;
+ u64 limit_bytes;
+} bpf_args;
+
+struct wq_elem {
+ struct bpf_wq work;
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY);
+ __uint(max_entries, 1);
+ __type(key, __u32);
+ __type(value, struct wq_elem);
+} wq_map SEC(".maps");
+
+static s64 allocated;
+static s64 old_allocated;
+static u64 async_free_run;
+static u64 initialize_status = 1;
+
+struct cgroup_memcg {
+ struct cgroup *cgrp;
+ struct mem_cgroup *memcg;
+};
+
+static int get_cgroup_memcg_from_id(u64 cgroup_id, struct cgroup_memcg *cm)
+{
+ cm->cgrp = bpf_cgroup_from_id(cgroup_id);
+ if (!cm->cgrp)
+ return -1;
+
+ cm->memcg = bpf_get_mem_cgroup(&cm->cgrp->self);
+ if (!cm->memcg) {
+ bpf_cgroup_release(cm->cgrp);
+ return -1;
+ }
+
+ return 0;
+}
+
+static void put_cgroup_memcg(struct cgroup_memcg *cm)
+{
+ bpf_put_mem_cgroup(cm->memcg);
+ bpf_cgroup_release(cm->cgrp);
+}
+
+static int async_free(void *map, int *key, void *value)
+{
+ struct cgroup_memcg cm;
+ bool started_wq = false;
+
+ if (get_cgroup_memcg_from_id(bpf_args.cgroup_id, &cm) != 0)
+ return 0;
+
+ if (bpf_try_to_free_mem_cgroup_pages(cm.memcg, 32, GFP_KERNEL,
+ 0, -1) > 0) {
+ if (bpf_mem_cgroup_usage(cm.memcg) >=
+ bpf_args.limit_bytes - (ONE_MB_PAGE_COUNT * __PAGE_SIZE)) {
+ __u32 key2 = 0;
+ struct wq_elem *elem;
+
+ elem = bpf_map_lookup_elem(&wq_map, &key2);
+ if (elem) {
+ bpf_wq_start(&elem->work, 0);
+ started_wq = true;
+ }
+ }
+ }
+ if (!started_wq)
+ __atomic_exchange_n(&async_free_run, 0, __ATOMIC_RELEASE);
+
+ put_cgroup_memcg(&cm);
+ return 0;
+}
+
+SEC("syscall")
+int prog_init(struct bpf_args_s *ctx)
+{
+ struct wq_elem *elem;
+ __u32 key = 0;
+ int ret;
+ u64 expected = 1;
+
+ if (!__atomic_compare_exchange_n(&initialize_status,
+ &expected, 2,
+ false,
+ __ATOMIC_ACQ_REL,
+ __ATOMIC_RELAXED))
+ return -1;
+
+ elem = bpf_map_lookup_elem(&wq_map, &key);
+ if (!elem)
+ return -1;
+ ret = bpf_wq_init(&elem->work, &wq_map, 0);
+ if (ret)
+ goto out;
+ ret = bpf_wq_set_callback(&elem->work, async_free, 0);
+ if (ret)
+ goto out;
+
+ allocated = 0;
+ async_free_run = 0;
+ bpf_args.cgroup_id = ctx->cgroup_id;
+ bpf_args.limit_bytes = ctx->limit_bytes;
+
+out:
+ return ret;
+}
+
+static u64 get_usage(void)
+{
+ u64 ret = 0;
+ struct cgroup_memcg cm;
+
+ if (get_cgroup_memcg_from_id(bpf_args.cgroup_id, &cm) != 0)
+ return 0;
+
+ ret = bpf_mem_cgroup_usage(cm.memcg);
+
+ put_cgroup_memcg(&cm);
+
+ return ret;
+}
+
+s64 abs_diff(s64 a, s64 b)
+{
+ return a > b ? a - b : b - a;
+}
+
+SEC("struct_ops/memcg_charged")
+unsigned int BPF_PROG(memcg_charged_impl, struct mem_cgroup *memcg,
+ unsigned int nr_pages)
+{
+ struct wq_elem *elem;
+ __u32 key = 0;
+ u64 expected = 0;
+ s64 cur_allocated;
+ s64 cur_old_allocated;
+
+ __atomic_add_fetch(&allocated, nr_pages, __ATOMIC_RELAXED);
+ cur_allocated = READ_ONCE(allocated);
+ cur_old_allocated = READ_ONCE(old_allocated);
+ if (abs_diff(cur_allocated, cur_old_allocated) < ONE_MB_PAGE_COUNT)
+ goto out;
+ WRITE_ONCE(old_allocated, cur_allocated);
+
+ if (get_usage() < bpf_args.limit_bytes)
+ goto out;
+
+ if (__atomic_compare_exchange_n(&async_free_run,
+ &expected, 1,
+ false,
+ __ATOMIC_ACQ_REL,
+ __ATOMIC_RELAXED)) {
+ elem = bpf_map_lookup_elem(&wq_map, &key);
+ if (!elem)
+ return 0;
+
+ bpf_wq_start(&elem->work, 0);
+ }
+
+out:
+ return 0;
+}
+
+SEC("struct_ops/memcg_uncharged")
+void BPF_PROG(memcg_uncharged_impl, struct mem_cgroup *memcg,
+ unsigned int nr_pages)
+{
+ __atomic_sub_fetch(&allocated, nr_pages, __ATOMIC_RELAXED);
+}
+
+SEC(".struct_ops.link")
+struct memcg_bpf_ops mcg_ops = {
+ .memcg_charged = (void *)memcg_charged_impl,
+ .memcg_uncharged = (void *)memcg_uncharged_impl,
+};
+
+char LICENSE[] SEC("license") = "GPL";
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC PATCH bpf-next v7 11/11] samples/bpf: Add memcg priority control and async reclaim example
2026-05-26 2:20 [RFC PATCH bpf-next v7 00/11] mm: BPF struct_ops for dynamic memory protection and async reclaim Hui Zhu
` (9 preceding siblings ...)
2026-05-26 2:27 ` [RFC PATCH bpf-next v7 10/11] selftests/bpf: Add selftest for memcg async reclaim via BPF Hui Zhu
@ 2026-05-26 2:27 ` Hui Zhu
10 siblings, 0 replies; 18+ messages in thread
From: Hui Zhu @ 2026-05-26 2:27 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, John Fastabend,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa,
Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Muchun Song, JP Kobryn, Andrew Morton, Shuah Khan, davem,
Jakub Kicinski, Jesper Dangaard Brouer, Stanislav Fomichev,
KP Singh, Tao Chen, Mykyta Yatsenko, Leon Hwang, Anton Protopopov,
Amery Hung, Tobias Klauser, Eyal Birger, Rong Tao, Hao Luo,
Peter Zijlstra, Miguel Ojeda, Nathan Chancellor, Kees Cook,
Tejun Heo, Jeff Xu, mkoutny, Jan Hendrik Farr, Christian Brauner,
Randy Dunlap, Brian Gerst, Masahiro Yamada, Willem de Bruijn,
Jason Xing, Paul Chaignon, Chen Ridong, Lance Yang, Jiayuan Chen,
linux-kernel, bpf, cgroups, linux-mm, netdev, linux-kselftest
Cc: geliang, baohua, Hui Zhu
From: Hui Zhu <zhuhui@kylinos.cn>
Add a sample program demonstrating two complementary use cases for the
`memcg_bpf_ops` feature: priority-based memory throttling and
workqueue-driven asynchronous page reclaim.
The sample consists of a BPF program and a userspace loader:
1. memcg.bpf.c: A BPF program with the following capabilities:
- Monitors PGFAULT events on a high-priority cgroup via a tracepoint.
When the per-second PGFAULT sum exceeds a configurable threshold,
a trigger timestamp is recorded.
- Priority throttling: uses the `below_low` / `below_min` hooks on
the high-priority cgroup and the `memcg_charged` hook on the
low-priority cgroup to apply a configurable delay (over_high_ms),
protecting the high-priority workload.
- Async reclaim: uses the `memcg_charged` / `memcg_uncharged` hooks
together with a BPF workqueue to trigger background page reclaim
on the low-priority cgroup when its memory usage exceeds a
configurable byte threshold (async_trigger_bytes), without
blocking the charging context.
- Six struct_ops variants are exported to allow userspace to attach
only the hooks needed for the chosen feature combination:
high_mcg_ops, high_mcg_ops_below_low, high_mcg_ops_below_min,
low_mcg_ops (combined), low_mcg_ops_high_delay, low_mcg_ops_async.
- A `prog_init` syscall program initialises the BPF workqueue and
copies the configuration from userspace before struct_ops are
attached.
2. memcg.c: A userspace loader that parses command-line arguments,
resolves cgroup IDs from filesystem inodes, loads the BPF skeleton,
calls prog_init via bpf_prog_test_run_opts(), and selects and
attaches the appropriate struct_ops map for the requested feature
combination. It supports BPF_F_ALLOW_OVERRIDE for stackable policies.
Users can run workloads of different priorities in two cgroups and
observe the low-priority workload being throttled or proactively
reclaimed to protect the high-priority one.
Example usage:
# Priority throttling only:
# ./memcg --low_path /sys/fs/cgroup/low \
# --high_path /sys/fs/cgroup/high \
# --threshold 1000 --over_high_ms 500 --use_below_low
# Async reclaim only:
# ./memcg --low_path /sys/fs/cgroup/low \
# --threshold 1000 --async_trigger_bytes 33554432
# Both features combined:
# ./memcg --low_path /sys/fs/cgroup/low \
# --high_path /sys/fs/cgroup/high \
# --threshold 1000 --over_high_ms 500 \
# --async_trigger_bytes 33554432
Signed-off-by: Barry Song <baohua@kernel.org>
Signed-off-by: Geliang Tang <geliang@kernel.org>
Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
---
MAINTAINERS | 2 +
samples/bpf/.gitignore | 1 +
samples/bpf/Makefile | 8 +-
samples/bpf/memcg.bpf.c | 380 +++++++++++++++++++++++++++++++++++++
samples/bpf/memcg.c | 411 ++++++++++++++++++++++++++++++++++++++++
5 files changed, 801 insertions(+), 1 deletion(-)
create mode 100644 samples/bpf/memcg.bpf.c
create mode 100644 samples/bpf/memcg.c
diff --git a/MAINTAINERS b/MAINTAINERS
index b2e64ef8c60c..a3f737a506b5 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6566,6 +6566,8 @@ F: mm/memcontrol-v1.c
F: mm/memcontrol-v1.h
F: mm/page_counter.c
F: mm/swap_cgroup.c
+F: samples/bpf/memcg.bpf.c
+F: samples/bpf/memcg.c
F: samples/cgroup/*
F: tools/testing/selftests/bpf/prog_tests/memcg_async_reclaim.c
F: tools/testing/selftests/bpf/prog_tests/memcg_ops.c
diff --git a/samples/bpf/.gitignore b/samples/bpf/.gitignore
index 0002cd359fb1..0de6569cdefd 100644
--- a/samples/bpf/.gitignore
+++ b/samples/bpf/.gitignore
@@ -49,3 +49,4 @@ iperf.*
/vmlinux.h
/bpftool/
/libbpf/
+memcg
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 95a4fa1f1e44..b00698bdc53b 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -37,6 +37,7 @@ tprogs-y += xdp_fwd
tprogs-y += task_fd_query
tprogs-y += ibumad
tprogs-y += hbm
+tprogs-y += memcg
# Libbpf dependencies
LIBBPF_SRC = $(TOOLS_PATH)/lib/bpf
@@ -122,6 +123,7 @@ always-y += task_fd_query_kern.o
always-y += ibumad_kern.o
always-y += hbm_out_kern.o
always-y += hbm_edt_kern.o
+always-y += memcg.bpf.o
COMMON_CFLAGS = $(TPROGS_USER_CFLAGS)
TPROGS_LDFLAGS = $(TPROGS_USER_LDFLAGS)
@@ -289,6 +291,8 @@ $(obj)/hbm_out_kern.o: $(src)/hbm.h $(src)/hbm_kern.h
$(obj)/hbm.o: $(src)/hbm.h
$(obj)/hbm_edt_kern.o: $(src)/hbm.h $(src)/hbm_kern.h
+memcg: $(obj)/memcg.skel.h
+
# Override includes for xdp_sample_user.o because $(srctree)/usr/include in
# TPROGS_CFLAGS causes conflicts
XDP_SAMPLE_CFLAGS += -Wall -O2 \
@@ -347,11 +351,13 @@ $(obj)/%.bpf.o: $(src)/%.bpf.c $(obj)/vmlinux.h $(src)/xdp_sample.bpf.h $(src)/x
-I$(LIBBPF_INCLUDE) $(CLANG_SYS_INCLUDES) \
-c $(filter %.bpf.c,$^) -o $@
-LINKED_SKELS := xdp_router_ipv4.skel.h
+LINKED_SKELS := xdp_router_ipv4.skel.h memcg.skel.h
clean-files += $(LINKED_SKELS)
xdp_router_ipv4.skel.h-deps := xdp_router_ipv4.bpf.o xdp_sample.bpf.o
+memcg.skel.h-deps := memcg.bpf.o
+
LINKED_BPF_SRCS := $(patsubst %.bpf.o,%.bpf.c,$(foreach skel,$(LINKED_SKELS),$($(skel)-deps)))
BPF_SRCS_LINKED := $(notdir $(wildcard $(src)/*.bpf.c))
diff --git a/samples/bpf/memcg.bpf.c b/samples/bpf/memcg.bpf.c
new file mode 100644
index 000000000000..0995284794ac
--- /dev/null
+++ b/samples/bpf/memcg.bpf.c
@@ -0,0 +1,380 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+#define ONE_SECOND_NS 1000000000
+#define ONE_MB_PAGE_COUNT 256
+
+/* GFP flags needed by bpf_try_to_free_mem_cgroup_pages() */
+#define BIT(nr) (1UL << (nr))
+#define ___GFP_IO BIT(___GFP_IO_BIT)
+#define ___GFP_FS BIT(___GFP_FS_BIT)
+#define ___GFP_DIRECT_RECLAIM BIT(___GFP_DIRECT_RECLAIM_BIT)
+#define ___GFP_KSWAPD_RECLAIM BIT(___GFP_KSWAPD_RECLAIM_BIT)
+#define __GFP_IO ((gfp_t)___GFP_IO)
+#define __GFP_FS ((gfp_t)___GFP_FS)
+#define __GFP_DIRECT_RECLAIM ((gfp_t)___GFP_DIRECT_RECLAIM)
+#define __GFP_KSWAPD_RECLAIM ((gfp_t)___GFP_KSWAPD_RECLAIM)
+#define __GFP_RECLAIM ((gfp_t)(___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM))
+#define GFP_KERNEL (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
+
+#define MEMCG_RECLAIM_MAY_SWAP (1 << 1)
+#define MEMCG_RECLAIM_PROACTIVE (1 << 2)
+
+#define ASYNC_FREE_BATCH 32
+#define ASYNC_FREE_LOOP_MAX 16
+
+#define READ_ONCE(x) (*(volatile typeof(x) *)&(x))
+#define WRITE_ONCE(x, val) ((*(volatile typeof(x) *)&(x)) = (val))
+
+struct local_config {
+ u64 threshold;
+ u64 high_cgroup_id;
+ u64 low_cgroup_id;
+ bool use_below_low;
+ bool use_below_min;
+ unsigned int over_high_ms;
+ u64 async_trigger_bytes;
+} local_config;
+
+struct AggregationData {
+ u64 sum;
+ u64 window_start_ts;
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY);
+ __uint(max_entries, 1);
+ __type(key, u32);
+ __type(value, struct AggregationData);
+} aggregation_map SEC(".maps");
+
+struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY);
+ __uint(max_entries, 1);
+ __type(key, u32);
+ __type(value, u64);
+} trigger_ts_map SEC(".maps");
+
+struct wq_elem {
+ struct bpf_wq work;
+};
+
+struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY);
+ __uint(max_entries, 1);
+ __type(key, __u32);
+ __type(value, struct wq_elem);
+} wq_map SEC(".maps");
+
+static s64 allocated;
+static s64 old_allocated;
+/*
+ * async_free_run: 0 = idle, 1 = workqueue item is queued/running.
+ * Acts as a one-shot guard: only one reclaim task is in-flight at
+ * a time. Cleared by async_free() once reclaim is complete and
+ * re-armed by __memcg_charged_impl() on the next trigger.
+ */
+static u64 async_free_run;
+
+/*
+ * wq_initialized: flipped from 0 -> 1 by prog_init() to make init
+ * idempotent if prog_init() is called more than once.
+ */
+static u64 wq_initialized;
+
+struct cgroup_memcg {
+ struct cgroup *cgrp;
+ struct mem_cgroup *memcg;
+};
+
+static int get_cgroup_memcg_from_id(u64 cgroup_id, struct cgroup_memcg *cm)
+{
+ cm->cgrp = bpf_cgroup_from_id(cgroup_id);
+ if (!cm->cgrp)
+ return -1;
+
+ cm->memcg = bpf_get_mem_cgroup(&cm->cgrp->self);
+ if (!cm->memcg) {
+ bpf_cgroup_release(cm->cgrp);
+ return -1;
+ }
+ return 0;
+}
+
+static void put_cgroup_memcg(struct cgroup_memcg *cm)
+{
+ bpf_put_mem_cgroup(cm->memcg);
+ bpf_cgroup_release(cm->cgrp);
+}
+
+static int async_free(void *map, int *key, void *value)
+{
+ struct cgroup_memcg cm;
+ bool started_wq = false;
+ int i;
+
+ if (get_cgroup_memcg_from_id(local_config.low_cgroup_id, &cm) != 0)
+ return 0;
+
+ for (i = 0; i < ASYNC_FREE_LOOP_MAX; i++) {
+ if (bpf_try_to_free_mem_cgroup_pages(cm.memcg, ASYNC_FREE_BATCH,
+ GFP_KERNEL,
+ MEMCG_RECLAIM_MAY_SWAP,
+ -1) <= 0)
+ break;
+
+ if (bpf_mem_cgroup_usage(cm.memcg) <
+ local_config.async_trigger_bytes)
+ break;
+ }
+
+ if (i == ASYNC_FREE_LOOP_MAX) {
+ __u32 k = 0;
+ struct wq_elem *elem = bpf_map_lookup_elem(&wq_map, &k);
+
+ if (elem) {
+ bpf_wq_start(&elem->work, 0);
+ started_wq = true;
+ }
+ }
+
+ put_cgroup_memcg(&cm);
+
+ if (!started_wq)
+ __atomic_exchange_n(&async_free_run, 0, __ATOMIC_RELEASE);
+ return 0;
+}
+
+SEC("syscall")
+int prog_init(struct local_config *ctx)
+{
+ struct wq_elem *elem;
+ __u32 key = 0;
+ u64 expected = 0;
+ int ret = -1;
+
+ /* Guard against double-initialisation */
+ if (!__atomic_compare_exchange_n(&wq_initialized, &expected, 1,
+ false,
+ __ATOMIC_ACQ_REL,
+ __ATOMIC_RELAXED))
+ goto out;
+
+ elem = bpf_map_lookup_elem(&wq_map, &key);
+ if (!elem)
+ goto out;
+ ret = bpf_wq_init(&elem->work, &wq_map, 0);
+ if (ret)
+ goto out;
+ ret = bpf_wq_set_callback(&elem->work, async_free, 0);
+ if (ret)
+ goto out;
+
+ allocated = 0;
+ async_free_run = 0;
+ __builtin_memcpy(&local_config, ctx, sizeof(local_config));
+
+out:
+ return ret;
+}
+
+SEC("tp/memcg/count_memcg_events")
+int handle_count_memcg_events(
+ struct trace_event_raw_memcg_rstat_events *ctx)
+{
+ u32 key = 0;
+ struct AggregationData *data;
+ u64 current_ts;
+
+ if (ctx->id != local_config.high_cgroup_id ||
+ ctx->item != PGFAULT)
+ goto out;
+
+ data = bpf_map_lookup_elem(&aggregation_map, &key);
+ if (!data)
+ goto out;
+
+ current_ts = bpf_ktime_get_ns();
+
+ if (current_ts - data->window_start_ts < ONE_SECOND_NS) {
+ data->sum += ctx->val;
+ } else {
+ data->window_start_ts = current_ts;
+ data->sum = ctx->val;
+ }
+
+ if (data->sum > local_config.threshold) {
+ bpf_map_update_elem(&trigger_ts_map, &key, ¤t_ts,
+ BPF_ANY);
+ data->sum = 0;
+ data->window_start_ts = current_ts;
+ }
+
+out:
+ return 0;
+}
+
+static bool need_threshold(void)
+{
+ u32 key = 0;
+ u64 *trigger_ts;
+ bool ret = false;
+ u64 current_ts;
+
+ trigger_ts = bpf_map_lookup_elem(&trigger_ts_map, &key);
+ if (!trigger_ts || *trigger_ts == 0)
+ goto out;
+
+ current_ts = bpf_ktime_get_ns();
+ if (current_ts - *trigger_ts < ONE_SECOND_NS)
+ ret = true;
+
+out:
+ return ret;
+}
+
+SEC("struct_ops/below_low")
+bool below_low_impl(struct mem_cgroup *memcg, unsigned long elow,
+ unsigned long usage)
+{
+ return need_threshold();
+}
+
+SEC("struct_ops/below_min")
+bool below_min_impl(struct mem_cgroup *memcg, unsigned long elow,
+ unsigned long usage)
+{
+ return need_threshold();
+}
+
+static u64 get_usage(void)
+{
+ u64 ret = 0;
+ struct cgroup_memcg cm;
+
+ if (get_cgroup_memcg_from_id(local_config.low_cgroup_id, &cm) != 0)
+ return 0;
+
+ ret = bpf_mem_cgroup_usage(cm.memcg);
+
+ put_cgroup_memcg(&cm);
+
+ return ret;
+}
+
+static __always_inline s64 abs_diff(s64 a, s64 b)
+{
+ return a > b ? a - b : b - a;
+}
+
+static __always_inline unsigned int
+__memcg_charged_impl(struct mem_cgroup *memcg, unsigned int nr_pages)
+{
+ struct wq_elem *elem;
+ __u32 key = 0;
+ u64 expected = 0;
+ s64 cur_allocated;
+ s64 cur_old_allocated;
+
+ __atomic_add_fetch(&allocated, nr_pages, __ATOMIC_RELAXED);
+ cur_allocated = READ_ONCE(allocated);
+ cur_old_allocated = READ_ONCE(old_allocated);
+ if (abs_diff(cur_allocated, cur_old_allocated) < ONE_MB_PAGE_COUNT)
+ goto out;
+ WRITE_ONCE(old_allocated, cur_allocated);
+
+ if (get_usage() < local_config.async_trigger_bytes)
+ goto out;
+
+ if (__atomic_compare_exchange_n(&async_free_run,
+ &expected, 1,
+ false,
+ __ATOMIC_ACQ_REL,
+ __ATOMIC_RELAXED)) {
+ elem = bpf_map_lookup_elem(&wq_map, &key);
+ if (!elem)
+ return 0;
+
+ bpf_wq_start(&elem->work, 0);
+ }
+
+out:
+ return 0;
+}
+
+SEC("struct_ops/memcg_charged")
+unsigned int BPF_PROG(memcg_charged_impl, struct mem_cgroup *memcg,
+ unsigned int nr_pages)
+{
+ return __memcg_charged_impl(memcg, nr_pages);
+}
+
+SEC("struct_ops/memcg_uncharged")
+void BPF_PROG(memcg_uncharged_impl, struct mem_cgroup *memcg,
+ unsigned int nr_pages)
+{
+ __atomic_sub_fetch(&allocated, nr_pages, __ATOMIC_RELAXED);
+}
+
+unsigned int
+__get_high_delay_ms_impl(struct mem_cgroup *memcg, unsigned int nr_pages)
+{
+ if (need_threshold())
+ return local_config.over_high_ms;
+
+ return 0;
+}
+
+SEC("struct_ops/memcg_charged")
+unsigned int BPF_PROG(get_high_delay_ms_impl, struct mem_cgroup *memcg,
+ unsigned int nr_pages)
+{
+ return __get_high_delay_ms_impl(memcg, nr_pages);
+}
+
+SEC("struct_ops/memcg_charged")
+unsigned int BPF_PROG(low_mcg_impl, struct mem_cgroup *memcg,
+ unsigned int nr_pages)
+{
+ __memcg_charged_impl(memcg, nr_pages);
+
+ return __get_high_delay_ms_impl(memcg, nr_pages);
+}
+
+SEC(".struct_ops.link")
+struct memcg_bpf_ops high_mcg_ops = {
+ .below_low = (void *)below_low_impl,
+ .below_min = (void *)below_min_impl,
+};
+
+SEC(".struct_ops.link")
+struct memcg_bpf_ops high_mcg_ops_below_low = {
+ .below_low = (void *)below_low_impl,
+};
+
+SEC(".struct_ops.link")
+struct memcg_bpf_ops high_mcg_ops_below_min = {
+ .below_min = (void *)below_min_impl,
+};
+
+SEC(".struct_ops.link")
+struct memcg_bpf_ops low_mcg_ops = {
+ .memcg_charged = (void *)low_mcg_impl,
+ .memcg_uncharged = (void *)memcg_uncharged_impl,
+};
+
+SEC(".struct_ops.link")
+struct memcg_bpf_ops low_mcg_ops_high_delay = {
+ .memcg_charged = (void *)get_high_delay_ms_impl,
+};
+
+SEC(".struct_ops.link")
+struct memcg_bpf_ops low_mcg_ops_async = {
+ .memcg_charged = (void *)memcg_charged_impl,
+ .memcg_uncharged = (void *)memcg_uncharged_impl,
+};
+
+char LICENSE[] SEC("license") = "GPL";
diff --git a/samples/bpf/memcg.c b/samples/bpf/memcg.c
new file mode 100644
index 000000000000..0929d868e6d8
--- /dev/null
+++ b/samples/bpf/memcg.c
@@ -0,0 +1,411 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <errno.h>
+#include <signal.h>
+#include <stdbool.h>
+#include <getopt.h>
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+#ifndef __MEMCG_RSTAT_SIMPLE_BPF_SKEL_H__
+#define u64 uint64_t
+#endif
+
+struct local_config {
+ u64 threshold;
+ u64 high_cgroup_id;
+ u64 low_cgroup_id;
+ bool use_below_low;
+ bool use_below_min;
+ unsigned int over_high_ms;
+ u64 async_trigger_bytes;
+};
+
+#include "memcg.skel.h"
+
+static bool exiting;
+
+static void sig_handler(int sig)
+{
+ exiting = true;
+}
+
+static void usage(const char *name)
+{
+ fprintf(stderr,
+ "Usage: %s --low_path=<path> --high_path=<path>\n"
+ " --threshold=<value> [OPTIONS]\n\n",
+ name);
+
+ fprintf(stderr, "Required arguments:\n");
+ fprintf(stderr,
+ " -l, --low_path=PATH Low priority memcgroup path\n");
+ fprintf(stderr,
+ " -g, --high_path=PATH High priority memcgroup path\n");
+ fprintf(stderr,
+ " -t, --threshold=VALUE Sum of PGFAULT 'val' events from\n"
+ " the high-priority cgroup per second\n"
+ " needed to trigger low-priority\n"
+ " cgroup throttling\n\n");
+
+ fprintf(stderr, "Priority throttling options:\n");
+ fprintf(stderr,
+ " -L, --use_below_low Enable the below_low hook on the\n"
+ " high-priority cgroup\n");
+ fprintf(stderr,
+ " -M, --use_below_min Enable the below_min hook on the\n"
+ " high-priority cgroup\n");
+ fprintf(stderr,
+ " -o, --over_high_ms=VALUE Delay (ms) returned by memcg_charged\n"
+ " for the low-priority cgroup while\n"
+ " throttling is active (default: 0)\n");
+ fprintf(stderr,
+ " -a, --async_trigger_bytes=BYTES\n"
+ " Memory threshold bytes for\n"
+ " the async-reclaim Low priority\n"
+ " memcgroup above which background\n"
+ " page reclaim is triggered.\n"
+ " 0 or omitted = feature disabled.\n");
+ fprintf(stderr,
+ " -O, --allow_override Set BPF_F_ALLOW_OVERRIDE when\n"
+ " attaching struct_ops\n\n");
+
+ fprintf(stderr, "Misc:\n");
+ fprintf(stderr, " -h, --help Show this help message\n\n");
+
+ fprintf(stderr, "Examples:\n");
+ fprintf(stderr,
+ " # Priority throttling only:\n"
+ " %s --low_path=/sys/fs/cgroup/low \\\n"
+ " --high_path=/sys/fs/cgroup/high \\\n"
+ " --threshold=1000 --over_high_ms=500 --use_below_low\n\n",
+ name);
+ fprintf(stderr,
+ " # Async reclaim only (no throttling):\n"
+ " %s --low_path=/sys/fs/cgroup/low \\\n"
+ " --threshold=1000 \\\n"
+ " --async_trigger_bytes=33554432\n\n",
+ name);
+ fprintf(stderr,
+ " # Both features combined:\n"
+ " %s --low_path=/sys/fs/cgroup/low \\\n"
+ " --high_path=/sys/fs/cgroup/high \\\n"
+ " --threshold=1000 --over_high_ms=500 \\\n"
+ " --async_trigger_bytes=33554432\n",
+ name);
+}
+
+static uint64_t get_cgroup_id(const char *cgroup_path)
+{
+ struct stat st;
+
+ if (!cgroup_path) {
+ fprintf(stderr, "Error: cgroup_path is NULL\n");
+ return 0;
+ }
+
+ if (stat(cgroup_path, &st) < 0) {
+ fprintf(stderr, "Error: stat(%s) failed: %d\n",
+ cgroup_path, errno);
+ return 0;
+ }
+
+ return (uint64_t)st.st_ino;
+}
+
+static uint64_t parse_u64(const char *str, const char *prog)
+{
+ uint64_t value;
+
+ errno = 0;
+ value = strtoull(str, NULL, 10);
+ if (errno != 0) {
+ fprintf(stderr, "ERROR: strtoull '%s' failed: %d\n",
+ str, errno);
+ usage(prog);
+ exit(-errno);
+ }
+ return value;
+}
+
+static int
+attach_ops(struct bpf_object *obj, __u32 opts_flags, const char *name, int fd,
+ struct bpf_link **link_ptr)
+{
+ int err;
+ struct bpf_map *map;
+ struct bpf_link *link;
+ DECLARE_LIBBPF_OPTS(bpf_struct_ops_opts, opts,
+ .flags = opts_flags | BPF_F_CGROUP_FD,
+ .target_fd = fd,
+ );
+
+ map = bpf_object__find_map_by_name(obj, name);
+ if (!map) {
+ fprintf(stderr,
+ "ERROR: Failed to find %s map\n", name);
+ err = -ESRCH;
+ goto out;
+ }
+ link = bpf_map__attach_struct_ops_opts(map, &opts);
+ err = libbpf_get_error(link);
+ if (err) {
+ link = NULL;
+ fprintf(stderr,
+ "Failed to attach struct ops %s: %d\n",
+ name, err);
+ goto out;
+ }
+ *link_ptr = link;
+
+out:
+ return err;
+}
+
+int main(int argc, char **argv)
+{
+ int low_cgroup_fd = -1, high_cgroup_fd = -1;
+ struct local_config local_config = {
+ .threshold = 1,
+ .high_cgroup_id = 0,
+ .low_cgroup_id = 0,
+ .use_below_low = false,
+ .use_below_min = false,
+ .over_high_ms = 0,
+ .async_trigger_bytes = 0,
+ };
+ LIBBPF_OPTS(bpf_test_run_opts, run_opts,
+ .ctx_in = &local_config,
+ .ctx_size_in = sizeof(local_config)
+ );
+ int prog_init_fd;
+ __u32 opts_flags = 0;
+ const char *low_path = NULL;
+ const char *high_path = NULL;
+ struct memcg *skel = NULL;
+ struct bpf_program *prog = NULL;
+ struct bpf_link *link = NULL, *link_low = NULL, *link_high = NULL;
+ int err = -EINVAL;
+ int opt;
+ int option_index = 0;
+
+ static struct option long_options[] = {
+ /* required */
+ {"low_path", required_argument, 0, 'l'},
+ {"high_path", required_argument, 0, 'g'},
+ {"threshold", required_argument, 0, 't'},
+ /* priority throttling */
+ {"over_high_ms", required_argument, 0, 'o'},
+ {"use_below_low", no_argument, 0, 'L'},
+ {"use_below_min", no_argument, 0, 'M'},
+ {"async_trigger_bytes", required_argument, 0, 'a'},
+ {"allow_override", no_argument, 0, 'O'},
+ /* misc */
+ {"help", no_argument, 0, 'h'},
+ {0, 0, 0, 0 }
+ };
+
+ while ((opt = getopt_long(argc, argv, "l:g:t:o:LMOa:h",
+ long_options, &option_index)) != -1) {
+ switch (opt) {
+ case 'l':
+ low_path = optarg;
+ break;
+ case 'g':
+ high_path = optarg;
+ break;
+ case 't':
+ local_config.threshold = parse_u64(optarg, argv[0]);
+ break;
+ case 'o':
+ local_config.over_high_ms
+ = (unsigned int)parse_u64(optarg, argv[0]);
+ break;
+ case 'L':
+ local_config.use_below_low = true;
+ break;
+ case 'M':
+ local_config.use_below_min = true;
+ break;
+ case 'O':
+ opts_flags = BPF_F_ALLOW_OVERRIDE;
+ break;
+ case 'a':
+ local_config.async_trigger_bytes
+ = parse_u64(optarg, argv[0]);
+ break;
+ case 'h':
+ usage(argv[0]);
+ return 0;
+ default:
+ usage(argv[0]);
+ return -EINVAL;
+ }
+ }
+
+ if ((!local_config.use_below_low &&
+ !local_config.use_below_min &&
+ !local_config.async_trigger_bytes &&
+ !local_config.over_high_ms) ||
+ ((local_config.use_below_low || local_config.use_below_min) &&
+ !high_path) ||
+ (local_config.async_trigger_bytes && !low_path) ||
+ (local_config.over_high_ms && (!high_path || !low_path))) {
+ fprintf(stderr, "ERROR: Missing required arguments\n\n");
+ usage(argv[0]);
+ goto out;
+ }
+
+
+ if (low_path) {
+ low_cgroup_fd = open(low_path, O_RDONLY);
+ if (low_cgroup_fd < 0) {
+ fprintf(stderr,
+ "ERROR: open low cgroup '%s' failed: %d\n",
+ low_path, errno);
+ err = -errno;
+ goto out;
+ }
+
+ local_config.low_cgroup_id = get_cgroup_id(low_path);
+ if (!local_config.low_cgroup_id) {
+ fprintf(stderr,
+ "ERROR: get low cgroup '%s' id failed: %d\n",
+ low_path, errno);
+ err = -errno;
+ goto out;
+ }
+ }
+
+ if (high_path) {
+ high_cgroup_fd = open(high_path, O_RDONLY);
+ if (high_cgroup_fd < 0) {
+ fprintf(stderr,
+ "ERROR: open high cgroup '%s' failed: %d\n",
+ high_path, errno);
+ err = -errno;
+ goto out;
+ }
+
+ local_config.high_cgroup_id = get_cgroup_id(high_path);
+ if (!local_config.high_cgroup_id) {
+ fprintf(stderr,
+ "ERROR: get high cgroup '%s' id failed: %d\n",
+ high_path, errno);
+ err = -errno;
+ goto out;
+ }
+ }
+
+ skel = memcg__open_and_load();
+ if (!skel) {
+ err = -errno;
+ fprintf(stderr,
+ "ERROR: opening and loading BPF skeleton failed: %d\n",
+ err);
+ goto out;
+ }
+
+ prog_init_fd = bpf_program__fd(skel->progs.prog_init);
+ err = bpf_prog_test_run_opts(prog_init_fd, &run_opts);
+ if (err || run_opts.retval) {
+ fprintf(stderr,
+ "ERROR: prog_init failed (err=%d retval=%d)\n",
+ err, run_opts.retval);
+ err = err ? err : -run_opts.retval;
+ goto out;
+ }
+
+ if (local_config.use_below_low && local_config.use_below_min) {
+ err = attach_ops(skel->obj, opts_flags, "high_mcg_ops",
+ high_cgroup_fd, &link_high);
+ if (err)
+ goto out;
+ } else if (local_config.use_below_low) {
+ err = attach_ops(skel->obj, opts_flags,
+ "high_mcg_ops_below_low",
+ high_cgroup_fd, &link_high);
+ if (err)
+ goto out;
+ } else if (local_config.use_below_min) {
+ err = attach_ops(skel->obj, opts_flags,
+ "high_mcg_ops_below_min",
+ high_cgroup_fd, &link_high);
+ if (err)
+ goto out;
+ }
+
+ if (local_config.over_high_ms && local_config.async_trigger_bytes) {
+ err = attach_ops(skel->obj, opts_flags,
+ "low_mcg_ops",
+ low_cgroup_fd, &link_low);
+ if (err)
+ goto out;
+ } else if (local_config.over_high_ms) {
+ err = attach_ops(skel->obj, opts_flags,
+ "low_mcg_ops_high_delay",
+ low_cgroup_fd, &link_low);
+ if (err)
+ goto out;
+ } else if (local_config.async_trigger_bytes) {
+ err = attach_ops(skel->obj, opts_flags,
+ "low_mcg_ops_async",
+ low_cgroup_fd, &link_low);
+ if (err)
+ goto out;
+ }
+
+ if (local_config.use_below_low || local_config.use_below_min ||
+ local_config.over_high_ms) {
+ prog = bpf_object__find_program_by_name(skel->obj,
+ "handle_count_memcg_events");
+ if (!prog) {
+ fprintf(stderr,
+ "ERROR: finding a prog in BPF object file failed\n");
+ goto out;
+ }
+
+ link = bpf_program__attach(prog);
+ err = libbpf_get_error(link);
+ if (err) {
+ link = NULL;
+ fprintf(stderr,
+ "ERROR: bpf_program__attach failed: %d\n",
+ err);
+ goto out;
+ }
+ }
+
+ printf("Successfully attached!\n");
+
+ signal(SIGINT, sig_handler);
+ signal(SIGTERM, sig_handler);
+
+ while (!exiting)
+ pause();
+
+ printf("Exiting...\n");
+ err = 0;
+
+out:
+ bpf_link__destroy(link);
+ bpf_link__destroy(link_low);
+ bpf_link__destroy(link_high);
+ if (skel) {
+ memcg__detach(skel);
+ memcg__destroy(skel);
+ }
+ if (low_cgroup_fd >= 0)
+ close(low_cgroup_fd);
+ if (high_cgroup_fd >= 0)
+ close(high_cgroup_fd);
+ return err;
+}
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [RFC PATCH bpf-next v7 03/11] libbpf: fix return value on memory allocation failure
2026-05-26 2:20 ` [RFC PATCH bpf-next v7 03/11] libbpf: fix return value on memory allocation failure Hui Zhu
@ 2026-05-26 3:06 ` bot+bpf-ci
0 siblings, 0 replies; 18+ messages in thread
From: bot+bpf-ci @ 2026-05-26 3:06 UTC (permalink / raw)
To: hui.zhu, ast, daniel, john.fastabend, andrii, martin.lau, eddyz87,
memxor, song, yonghong.song, jolsa, hannes, mhocko,
roman.gushchin, shakeel.butt, muchun.song, inwardvessel, akpm,
shuah, davem, kuba, hawk, sdf, kpsingh, chen.dylane, yatsenko,
leon.hwang, a.s.protopopov, ameryhung, tklauser, eyal.birger,
rongtao, haoluo, peterz, ojeda, nathan, kees, tj, jeffxu, mkoutny,
kernel, brauner, rdunlap, brgerst, masahiroy, willemb,
kerneljasonxing, paul.chaignon, chenridong, lance.yang, jiayu
Cc: geliang, baohua, laoar.shao, ast, andrii, daniel, martin.lau,
eddyz87, yonghong.song, clm, ihor.solodrai
[-- Attachment #1: Type: text/plain, Size: 1519 bytes --]
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index ab2071fdd3e8..1e8688975d16 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -13701,7 +13701,7 @@ struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
>
> link = calloc(1, sizeof(*link));
> if (!link)
> - return libbpf_err_ptr(-EINVAL);
> + return libbpf_err_ptr(-ENOMEM);
>
> /* kern_vdata should be prepared during the loading phase. */
> err = bpf_map_update_elem(map->fd, &zero, map->st_ops->kern_vdata, 0);
This is a pre-existing issue, but can this dereference a NULL map->st_ops
pointer?
If a user manually changes a map's type to BPF_MAP_TYPE_STRUCT_OPS using
bpf_map__set_type(), or if a malformed BPF object file defines a map in the
.maps section with this type, map->st_ops is never allocated.
The bpf_object_init_struct_ops() function only allocates st_ops for maps
found in the .struct_ops and .struct_ops.link sections.
Because bpf_map__is_struct_ops() only checks map->def.type, could a missing
NULL check on map->st_ops here and in bpf_map__init_kern_struct_ops() lead to
a crash when privileged daemons parse untrusted BPF ELF files?
Reviewed-by lore thread: https://lore.kernel.org/bpf/20260526024145.5309F1F000E9@smtp.kernel.org/
---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26429228214
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH bpf-next v7 04/11] libbpf: introduce bpf_map__attach_struct_ops_opts()
2026-05-26 2:20 ` [RFC PATCH bpf-next v7 04/11] libbpf: introduce bpf_map__attach_struct_ops_opts() Hui Zhu
@ 2026-05-26 3:06 ` bot+bpf-ci
0 siblings, 0 replies; 18+ messages in thread
From: bot+bpf-ci @ 2026-05-26 3:06 UTC (permalink / raw)
To: hui.zhu, ast, daniel, john.fastabend, andrii, martin.lau, eddyz87,
memxor, song, yonghong.song, jolsa, hannes, mhocko,
roman.gushchin, shakeel.butt, muchun.song, inwardvessel, akpm,
shuah, davem, kuba, hawk, sdf, kpsingh, chen.dylane, yatsenko,
leon.hwang, a.s.protopopov, ameryhung, tklauser, eyal.birger,
rongtao, haoluo, peterz, ojeda, nathan, kees, tj, jeffxu, mkoutny,
kernel, brauner, rdunlap, brgerst, masahiroy, willemb,
kerneljasonxing, paul.chaignon, chenridong, lance.yang, jiayu
Cc: geliang, baohua, ast, andrii, daniel, martin.lau, eddyz87,
yonghong.song, clm, ihor.solodrai
[-- Attachment #1: Type: text/plain, Size: 3135 bytes --]
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 1e8688975d16..a1b54da1ded2 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
[ ... ]
> @@ -13683,11 +13683,18 @@ static int bpf_link__detach_struct_ops(struct bpf_link *link)
> return close(link->fd);
> }
>
> -struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
> +struct bpf_link *bpf_map__attach_struct_ops_opts(const struct bpf_map *map,
> + const struct bpf_struct_ops_opts *opts)
> {
> + DECLARE_LIBBPF_OPTS(bpf_link_create_opts, link_opts);
> struct bpf_link_struct_ops *link;
> + int err, fd, target_fd;
> __u32 zero = 0;
> - int err, fd;
> +
> + if (!OPTS_VALID(opts, bpf_struct_ops_opts)) {
> + pr_warn("map '%s': invalid opts\n", map->name);
> + return libbpf_err_ptr(-EINVAL);
> + }
>
> if (!bpf_map__is_struct_ops(map)) {
> pr_warn("map '%s': can't attach non-struct_ops map\n", map->name);
[ ... ]
> @@ -13724,7 +13731,9 @@ struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
> return &link->link;
> }
>
> - fd = bpf_link_create(map->fd, 0, BPF_STRUCT_OPS, NULL);
> + link_opts.flags = OPTS_GET(opts, flags, 0);
> + target_fd = OPTS_GET(opts, target_fd, 0);
> + fd = bpf_link_create(map->fd, target_fd, BPF_STRUCT_OPS, &link_opts);
The expected_revision field declared in struct bpf_struct_ops_opts is never
read or used. The code only reads flags and target_fd from opts, which could
cause silent data loss if users set expected_revision expecting it to be
honored.
Looking at similar opts structures (bpf_cgroup_opts, bpf_tcx_opts,
bpf_netkit_opts), they all pass expected_revision to bpf_link_create_opts.
Should expected_revision be propagated here as well?
This was raised by bot+bpf-ci@kernel.org in v3:
https://lore.kernel.org/bpf/3af5210f45be4a508e584263a5517781ca762d28d4f6b831b987795662715e61@mail.kernel.org/
> if (fd < 0) {
> free(link);
> return libbpf_err_ptr(fd);
[ ... ]
> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index bba4e8464396..18af178547ad 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -945,6 +945,20 @@ bpf_program__attach_cgroup_opts(const struct bpf_program *prog, int cgroup_fd,
> struct bpf_map;
>
> LIBBPF_API struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map);
> +
> +struct bpf_struct_ops_opts {
> + /* size of this struct, for forward/backward compatibility */
> + size_t sz;
> + __u32 flags;
> + __u32 target_fd;
> + __u64 expected_revision;
> + size_t :0;
> +};
> +#define bpf_struct_ops_opts__last_field expected_revision
> +
> +LIBBPF_API struct bpf_link *
> +bpf_map__attach_struct_ops_opts(const struct bpf_map *map,
> + const struct bpf_struct_ops_opts *opts);
> LIBBPF_API int bpf_link__update_map(struct bpf_link *link, const struct bpf_map *map);
>
> struct bpf_iter_attach_opts {
[ ... ]
---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26429228214
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH bpf-next v7 07/11] mm/bpf: Add bpf_try_to_free_mem_cgroup_pages kfunc
2026-05-26 2:24 ` [RFC PATCH bpf-next v7 07/11] mm/bpf: Add bpf_try_to_free_mem_cgroup_pages kfunc Hui Zhu
@ 2026-05-26 3:06 ` bot+bpf-ci
0 siblings, 0 replies; 18+ messages in thread
From: bot+bpf-ci @ 2026-05-26 3:06 UTC (permalink / raw)
To: hui.zhu, ast, daniel, john.fastabend, andrii, martin.lau, eddyz87,
memxor, song, yonghong.song, jolsa, hannes, mhocko,
roman.gushchin, shakeel.butt, muchun.song, inwardvessel, akpm,
shuah, davem, kuba, hawk, sdf, kpsingh, chen.dylane, yatsenko,
leon.hwang, a.s.protopopov, ameryhung, tklauser, eyal.birger,
rongtao, haoluo, peterz, ojeda, nathan, kees, tj, jeffxu, mkoutny,
kernel, brauner, rdunlap, brgerst, masahiroy, willemb,
kerneljasonxing, paul.chaignon, chenridong, lance.yang, jiayu
Cc: geliang, baohua, zhuhui, ast, andrii, daniel, martin.lau, eddyz87,
yonghong.song, clm, ihor.solodrai
[-- Attachment #1: Type: text/plain, Size: 4748 bytes --]
> diff --git a/mm/bpf_memcontrol.c b/mm/bpf_memcontrol.c
> index 1f726a7b22e3..0353c8736aa5 100644
> --- a/mm/bpf_memcontrol.c
> +++ b/mm/bpf_memcontrol.c
[ ... ]
> @@ -162,6 +163,60 @@ __bpf_kfunc void bpf_mem_cgroup_flush_stats(struct mem_cgroup *memcg)
> mem_cgroup_flush_stats(memcg);
> }
>
> +/**
> + * bpf_try_to_free_mem_cgroup_pages - attempt to reclaim pages from
> + * a memory cgroup
> + * @memcg: the target memory cgroup to reclaim from
> + * @nr_pages: the number of pages to reclaim
> + * @gfp_mask: GFP flags controlling the reclaim behavior
> + * @reclaim_options: bitmask of MEMCG_RECLAIM_* flags to tune
> + * reclaim strategy
> + * @swappiness: swappiness override value, or a sentinel to use
> + * the default
> + *
> + * BPF-facing wrapper around try_to_free_mem_cgroup_pages() that
> + * validates and translates the @swappiness argument before
> + * delegating to the core reclaim path.
> + *
> + * The @swappiness parameter follows these semantics:
> + * - Values in [MIN_SWAPPINESS, SWAPPINESS_ANON_ONLY] are passed
> + * through as an explicit swappiness override.
> + * - Values below MIN_SWAPPINESS are treated as "use the system
> + * default"; the override pointer is set to NULL and the cgroup's
> + * own swappiness setting takes effect.
> + * - Values above SWAPPINESS_ANON_ONLY are rejected as invalid.
> + * - If @reclaim_options does not include MEMCG_RECLAIM_PROACTIVE,
> + * the @swappiness override is ignored entirely by the core
> + * reclaim path and the system default is used regardless.
> + *
> + * Swap usage during reclaim is gated on @reclaim_options: swap is
> + * considered only when MEMCG_RECLAIM_MAY_SWAP is set. Without this
> + * flag, reclaim is restricted to file-backed pages regardless of the
> + * @swappiness value or the cgroup's swappiness setting.
> + *
> + * Return:
> + * The number of pages actually reclaimed on success, or -%EINVAL
> + * if @swappiness exceeds SWAPPINESS_ANON_ONLY.
> + */
> +unsigned long bpf_try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
Should this function have the __bpf_kfunc annotation? Every other kfunc
in this file (bpf_get_root_mem_cgroup, bpf_get_mem_cgroup,
bpf_put_mem_cgroup, bpf_mem_cgroup_vm_events, bpf_mem_cgroup_usage,
bpf_mem_cgroup_memory_events, bpf_mem_cgroup_page_state,
bpf_mem_cgroup_flush_stats) has it.
Documentation/bpf/kfuncs.rst section 2.5 states that kfunc definitions
should always be annotated with the __bpf_kfunc macro to prevent the
compiler from inlining the kfunc or the function being elided in an LTO
build as it's not used elsewhere in the kernel.
> + unsigned long nr_pages,
> + gfp_t gfp_mask,
> + unsigned int reclaim_options,
> + int swappiness)
> +{
> + int *swapiness_ptr;
> +
> + if (swappiness > SWAPPINESS_ANON_ONLY)
> + return -EINVAL;
How can BPF callers detect this error? The function returns
'unsigned long' (matching try_to_free_mem_cgroup_pages()'s page count),
but here it returns -EINVAL.
Implicit conversion of -EINVAL (int -22) to unsigned long produces
0xFFFFFFFFFFFFFFEA -- a very large positive value, not a negative error
code. The kernel-doc above explicitly states "or -%EINVAL if
@swappiness exceeds SWAPPINESS_ANON_ONLY", which is impossible for an
unsigned return.
A BPF caller cannot distinguish the error from a real reclaim count via
'if (ret < 0)' because ret is unsigned. The sample BPF program later
in the series (samples/bpf/memcg.bpf.c in commit 8dcfd8aa3316) checks
'bpf_try_to_free_mem_cgroup_pages(...) <= 0' which would only match 0
and would silently treat the giant 'error' value as 'many pages
reclaimed'.
Should this either return 'long' instead (so callers can branch on
negative), or use a clearly defined sentinel like (unsigned long)-1
(as the other kfuncs in this file do for errors)?
> + else if (swappiness < MIN_SWAPPINESS)
> + swapiness_ptr = NULL;
> + else
> + swapiness_ptr = &swappiness;
> +
> + return try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask,
> + reclaim_options, swapiness_ptr);
> +}
[ ... ]
> @@ -175,6 +230,8 @@ BTF_ID_FLAGS(func, bpf_mem_cgroup_usage)
> BTF_ID_FLAGS(func, bpf_mem_cgroup_page_state)
> BTF_ID_FLAGS(func, bpf_mem_cgroup_flush_stats, KF_SLEEPABLE)
>
> +BTF_ID_FLAGS(func, bpf_try_to_free_mem_cgroup_pages, KF_SLEEPABLE)
> +
> BTF_KFUNCS_END(bpf_memcontrol_kfuncs)
---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26429228214
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH bpf-next v7 10/11] selftests/bpf: Add selftest for memcg async reclaim via BPF
2026-05-26 2:27 ` [RFC PATCH bpf-next v7 10/11] selftests/bpf: Add selftest for memcg async reclaim via BPF Hui Zhu
@ 2026-05-26 3:06 ` bot+bpf-ci
0 siblings, 0 replies; 18+ messages in thread
From: bot+bpf-ci @ 2026-05-26 3:06 UTC (permalink / raw)
To: hui.zhu, ast, daniel, john.fastabend, andrii, martin.lau, eddyz87,
memxor, song, yonghong.song, jolsa, hannes, mhocko,
roman.gushchin, shakeel.butt, muchun.song, inwardvessel, akpm,
shuah, davem, kuba, hawk, sdf, kpsingh, chen.dylane, yatsenko,
leon.hwang, a.s.protopopov, ameryhung, tklauser, eyal.birger,
rongtao, haoluo, peterz, ojeda, nathan, kees, tj, jeffxu, mkoutny,
kernel, brauner, rdunlap, brgerst, masahiroy, willemb,
kerneljasonxing, paul.chaignon, chenridong, lance.yang, jiayu
Cc: geliang, baohua, zhuhui, ast, andrii, daniel, martin.lau, eddyz87,
yonghong.song, clm, ihor.solodrai
[-- Attachment #1: Type: text/plain, Size: 1826 bytes --]
> diff --git a/MAINTAINERS b/MAINTAINERS
> index c9b3dfe2498b..29de8e2fe55d 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -6576,7 +6576,9 @@ F: mm/memcontrol-v1.h
[ ... ]
> diff --git a/tools/testing/selftests/bpf/prog_tests/memcg_async_reclaim.c b/tools/testing/selftests/bpf/prog_tests/memcg_async_reclaim.c
> new file mode 100644
> index 000000000000..bf25967c911c
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/memcg_async_reclaim.c
> @@ -0,0 +1,333 @@
[ ... ]
> +void test_memcg_async_reclaim(void)
> +{
> + u64 cgroup_id, old_max, new_max;
> + int cgroup_fd, ret;
> + struct memcg_async_reclaim *skel;
> + struct bpf_link *link = NULL;
> + char data_file1[] = "/tmp/test_data_1_XXXXXX";
> + char data_file2[] = "/tmp/test_data_2_XXXXXX";
Can skel be dereferenced if it isn't initialized? The skel variable is
declared without initialization to NULL, while link is explicitly
initialized to NULL. If control reaches the cleanup block without
memcg_async_reclaim__open_and_load() succeeding, skel remains
uninitialized:
> +cleanup:
> + bpf_link__destroy(link);
> + memcg_async_reclaim__detach(skel);
> + memcg_async_reclaim__destroy(skel);
The generated detach and destroy helpers will dereference skel through
obj->skeleton. Other tests in prog_tests/ check if (skel) before calling
detach/destroy to handle this case safely.
This was raised by XIAO WU in v6 review
(https://lore.kernel.org/all/20260423171516.00004adc@gmail.com/) in the
context of memcg_ops hierarchies test, but the same pattern appears in
this new async_reclaim test added in v7.
---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26429228214
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH bpf-next v7 06/11] mm: memcontrol: Add BPF struct_ops for memory controller
2026-05-26 2:24 ` [RFC PATCH bpf-next v7 06/11] mm: memcontrol: Add BPF struct_ops for memory controller Hui Zhu
@ 2026-05-26 3:19 ` bot+bpf-ci
0 siblings, 0 replies; 18+ messages in thread
From: bot+bpf-ci @ 2026-05-26 3:19 UTC (permalink / raw)
To: hui.zhu, ast, daniel, john.fastabend, andrii, martin.lau, eddyz87,
memxor, song, yonghong.song, jolsa, hannes, mhocko,
roman.gushchin, shakeel.butt, muchun.song, inwardvessel, akpm,
shuah, davem, kuba, hawk, sdf, kpsingh, chen.dylane, yatsenko,
leon.hwang, a.s.protopopov, ameryhung, tklauser, eyal.birger,
rongtao, haoluo, peterz, ojeda, nathan, kees, tj, jeffxu, mkoutny,
kernel, brauner, rdunlap, brgerst, masahiroy, willemb,
kerneljasonxing, paul.chaignon, chenridong, lance.yang, jiayu
Cc: geliang, baohua, zhuhui, ast, andrii, daniel, martin.lau, eddyz87,
yonghong.song, clm, ihor.solodrai
[-- Attachment #1: Type: text/plain, Size: 4805 bytes --]
> diff --git a/mm/bpf_memcontrol.c b/mm/bpf_memcontrol.c
> index 716df49d7647..1f726a7b22e3 100644
> --- a/mm/bpf_memcontrol.c
> +++ b/mm/bpf_memcontrol.c
[ ... ]
> +/* Unregister the struct ops instance */
> +static void bpf_memcg_ops_unreg(void *kdata, struct bpf_link *link)
> +{
> + struct bpf_struct_ops_link *ops_link;
> + struct memcg_bpf_ops *ops = kdata;
> + struct cgroup_subsys_state *css;
> + struct mem_cgroup *memcg;
> + struct mem_cgroup *iter;
> + struct memcg_bpf_ops *parent_bpf_ops = NULL;
> + u32 parent_bpf_ops_flags = 0;
> +
> + if (!link)
> + return;
> + ops_link = container_of(link, struct bpf_struct_ops_link, link);
> + if (!ops_link->cgroup)
> + return;
Can this early return skip a required synchronize_srcu(&memcg_bpf_srcu)?
When a cgroup that owns a memcg_bpf_ops link is destroyed,
cgroup_bpf_release() takes cgroup_mutex, splices the struct_ops_links,
sets st_link->cgroup = NULL, drops cgroup_mutex, and then calls
link->ops->detach() outside the mutex. detach()
(bpf_struct_ops_map_link_detach) invokes unreg() with link->cgroup
already NULL, which hits this early return:
if (!ops_link->cgroup)
return;
At that point descendants of the destroyed cgroup still hold
memcg->bpf_ops pointing at the registered ops (kdata). Memcgs in that
subtree are not freed immediately because folios pin them via
folio->memcg_data after offline, and uncharge paths
(memcg_uncharge() -> bpf_memcg_uncharged()) keep dereferencing those
memcgs.
bpf_memcg_uncharged()/bpf_memcg_charged() use the BPF_MEMCG_CALL macros,
which take srcu_read_lock(&memcg_bpf_srcu), READ_ONCE(memcg->bpf_ops),
and then dereference __ops->op. After unreg() returns, the
bpf_struct_ops infrastructure eventually frees kdata via
bpf_struct_ops_map_free(), which calls
synchronize_rcu_mult(call_rcu, call_rcu_tasks) and waits only for
classic RCU and RCU-tasks grace periods, not SRCU.
If the unreg() path returns without calling
synchronize_srcu(&memcg_bpf_srcu), can an SRCU reader holding a pointer
into the about-to-be-freed kdata observe a use-after-free when invoking
__ops->op()?
Would moving the synchronize_srcu(&memcg_bpf_srcu) call so it runs on
both early-return paths (or unconditionally before returning) be
appropriate here? The other early return (link == NULL) looks benign
because reg() returns -EOPNOTSUPP for link == NULL and never writes any
memcg->bpf_ops, so no synchronize is required on that path.
> + cgroup_lock();
> +
> + css = cgroup_e_css(ops_link->cgroup, &memory_cgrp_subsys);
> + if (!css)
> + goto unlock_out;
> + memcg = mem_cgroup_from_css(css);
> +
> + /* Get the parent bpf_ops and bpf_ops_flags */
> + iter = parent_mem_cgroup(memcg);
> + if (iter) {
> + parent_bpf_ops = READ_ONCE(iter->bpf_ops);
> + parent_bpf_ops_flags = iter->bpf_ops_flags;
> + }
> +
> + iter = NULL;
> + while ((iter = mem_cgroup_iter(memcg, iter, NULL))) {
> + if (READ_ONCE(iter->bpf_ops) == ops) {
> + WRITE_ONCE(iter->bpf_ops, parent_bpf_ops);
> + iter->bpf_ops_flags = parent_bpf_ops_flags;
> + }
> + }
> +
> +unlock_out:
> + cgroup_unlock();
> + synchronize_srcu(&memcg_bpf_srcu);
> +}
[ ... ]
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 749c128b4fad..3494f36d792d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
[ ... ]
> @@ -2516,11 +2522,15 @@ void __mem_cgroup_handle_over_high(gfp_t gfp_mask)
> * memory.high is breached and reclaim is unable to keep up. Throttle
> * allocators proactively to slow down excessive growth.
> */
> - penalty_jiffies = calculate_high_delay(memcg, nr_pages,
> - mem_find_max_overage(memcg));
> + if (nr_pages) {
> + penalty_jiffies = calculate_high_delay(
> + memcg, nr_pages, mem_find_max_overage(memcg));
>
> - penalty_jiffies += calculate_high_delay(memcg, nr_pages,
> - swap_find_max_overage(memcg));
> + penalty_jiffies += calculate_high_delay(
> + memcg, nr_pages, swap_find_max_overage(memcg));
> + } else
> + penalty_jiffies = 0;
> + penalty_jiffies = max(penalty_jiffies, bpf_high_delay);
This isn't a bug, but the if/else here mixes braced and unbraced
branches. Documentation/process/coding-style.rst (Placing Braces and
Spaces) asks that when one arm of an if/else uses braces, the other arm
use them too. Would something like the following be preferable, and
quiet checkpatch.pl?
if (nr_pages) {
penalty_jiffies = calculate_high_delay(
memcg, nr_pages, mem_find_max_overage(memcg));
penalty_jiffies += calculate_high_delay(
memcg, nr_pages, swap_find_max_overage(memcg));
} else {
penalty_jiffies = 0;
}
---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26429228214
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH bpf-next v7 02/11] bpf: allow attaching struct_ops to cgroups
2026-05-26 2:20 ` [RFC PATCH bpf-next v7 02/11] bpf: allow attaching struct_ops to cgroups Hui Zhu
@ 2026-05-26 3:19 ` bot+bpf-ci
0 siblings, 0 replies; 18+ messages in thread
From: bot+bpf-ci @ 2026-05-26 3:19 UTC (permalink / raw)
To: hui.zhu, ast, daniel, john.fastabend, andrii, martin.lau, eddyz87,
memxor, song, yonghong.song, jolsa, hannes, mhocko,
roman.gushchin, shakeel.butt, muchun.song, inwardvessel, akpm,
shuah, davem, kuba, hawk, sdf, kpsingh, chen.dylane, yatsenko,
leon.hwang, a.s.protopopov, ameryhung, tklauser, eyal.birger,
rongtao, haoluo, peterz, ojeda, nathan, kees, tj, jeffxu, mkoutny,
kernel, brauner, rdunlap, brgerst, masahiroy, willemb,
kerneljasonxing, paul.chaignon, chenridong, lance.yang, jiayu
Cc: geliang, baohua, ast, andrii, daniel, martin.lau, eddyz87,
yonghong.song, clm, ihor.solodrai
[-- Attachment #1: Type: text/plain, Size: 10195 bytes --]
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index dd4f34e041ab..bd1b94a618ad 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1922,6 +1922,9 @@ struct bpf_raw_tp_link {
> struct bpf_struct_ops_link {
> struct bpf_link link;
> struct bpf_map __rcu *map;
> + struct cgroup *cgroup;
> + bool cgroup_removed;
> + struct list_head list;
> wait_queue_head_t wait_hup;
> };
Yafang Shao raised a design concern about the struct_ops_link structure:
The struct has hardcoded cgroup-specific fields. He suggested a more
generic solution using 'int type' and 'void *private' fields to support
other structs (task, etc.) in the future. This design feedback was not
addressed or responded to.
Reference: https://lore.kernel.org/bpf/CALOAHbDXmgi=yeB1c2zzQ7-Lz2+MEZvtbxQp1=mWXMJSsg3sgQ@mail.gmail.com/
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index aec171ccb6ef..f547613986cc 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1246,6 +1246,7 @@ enum bpf_perf_event_type {
> #define BPF_F_AFTER (1U << 4)
> #define BPF_F_ID (1U << 5)
> #define BPF_F_PREORDER (1U << 6)
> +#define BPF_F_CGROUP_FD (1U << 7)
> #define BPF_F_LINK BPF_F_LINK /* 1 << 13 */
Since both fdinfo and link_info show the cgroup ID, would BPF_F_CGROUP_ID
be a better name than BPF_F_CGROUP_FD for alignment?
This naming suggestion came from Yafang Shao and was not addressed or
responded to.
Reference: https://lore.kernel.org/bpf/CALOAHbDXmgi=yeB1c2zzQ7-Lz2+MEZvtbxQp1=mWXMJSsg3sgQ@mail.gmail.com/
> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> index 7ad3b1a49dee..e63f926d8728 100644
> --- a/kernel/bpf/bpf_struct_ops.c
> +++ b/kernel/bpf/bpf_struct_ops.c
> @@ -1264,6 +1270,7 @@ static void bpf_struct_ops_map_link_show_fdinfo(const struct bpf_link *link,
> {
> struct bpf_struct_ops_link *st_link;
> struct bpf_map *map;
> + u64 cgrp_id = 0;
Assigning 0 to cgrp_id would technically be incorrect. The cgroup_id()
function returns 1 for !CONFIG_CGROUPS, and for CONFIG_CGROUPS the ID
allocation via idr_alloc_cyclic() uses a range between 1 and INT_MAX.
Would a sentinel value like (u64)-1 be more appropriate?
Matt Bobrowski raised this concern and Roman acknowledged the review,
but this was not fixed in the current version.
Reference: https://lore.kernel.org/bpf/aXnyKw5sRt_MB-8A@google.com/
> @@ -1271,6 +1278,14 @@ static void bpf_struct_ops_map_link_show_fdinfo(const struct bpf_link *link,
> if (map)
> seq_printf(seq, "map_id:\t%d\n", map->id);
> rcu_read_unlock();
> +
> + cgroup_lock();
> + if (st_link->cgroup)
> + cgrp_id = cgroup_id(st_link->cgroup);
> + cgroup_unlock();
Could an inline helper be introduced for the cgroup_lock()/cgroup_id()/
cgroup_unlock() pattern? This same dance appears in both
bpf_struct_ops_map_link_show_fdinfo() and
bpf_struct_ops_map_link_fill_link_info().
Matt Bobrowski suggested this and Roman said he would try, but the
helper was not added in the current version.
Reference: https://lore.kernel.org/bpf/aXnyKw5sRt_MB-8A@google.com/
> @@ -1393,8 +1422,12 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
> struct bpf_link_primer link_primer;
> struct bpf_struct_ops_map *st_map;
> struct bpf_map *map;
> + struct cgroup *cgrp;
> int err;
>
> + if (attr->link_create.flags & ~BPF_F_CGROUP_FD)
> + return -EINVAL;
Should all the cgroup-related logic be protected by CONFIG_CGROUPS ifdef?
Both cgroup_get_from_fd() and cgroup_put() lack stubs when building with
!CONFIG_CGROUPS. Should BPF_F_CGROUP_FD only be accepted when
CONFIG_BPF_CGROUP is enabled, returning -EOPNOTSUPP otherwise?
Matt Bobrowski raised this and Roman acknowledged the review, but these
guards were not added in the current version.
Reference: https://lore.kernel.org/bpf/aXnyKw5sRt_MB-8A@google.com/
> @@ -1414,11 +1447,26 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
> bpf_link_init(&link->link, BPF_LINK_TYPE_STRUCT_OPS, &bpf_struct_ops_map_lops, NULL,
> attr->link_create.attach_type);
>
> + init_waitqueue_head(&link->wait_hup);
> +
> + if (attr->link_create.flags & BPF_F_CGROUP_FD) {
> + cgrp = cgroup_get_from_fd(attr->link_create.target_fd);
> + if (IS_ERR(cgrp)) {
> + err = PTR_ERR(cgrp);
> + goto err_out;
> + }
> + link->cgroup = cgrp;
> + err = cgroup_bpf_attach_struct_ops(cgrp, link);
> + if (err) {
> + cgroup_put(cgrp);
> + link->cgroup = NULL;
> + goto err_out;
> + }
> + }
> +
> err = bpf_link_prime(&link->link, &link_primer);
> if (err)
> - goto err_out;
> -
> - init_waitqueue_head(&link->wait_hup);
> + goto err_put_cgroup;
>
> /* Hold the update_mutex such that the subsystem cannot
> * do link->ops->detach() before the link is fully initialized.
> @@ -1429,13 +1477,16 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
> mutex_unlock(&update_mutex);
> bpf_link_cleanup(&link_primer);
> link = NULL;
> - goto err_out;
> + goto err_put_cgroup;
> }
> RCU_INIT_POINTER(link->map, map);
> mutex_unlock(&update_mutex);
>
> return bpf_link_settle(&link_primer);
>
> +err_put_cgroup:
> + if (link && link->cgroup)
> + cgroup_bpf_detach_struct_ops(link->cgroup, link);
> err_out:
> bpf_map_put(map);
> kfree(link);
Can the error path here trigger a use-after-free?
After cgroup_bpf_attach_struct_ops() succeeds, the link is added to
cgrp->bpf.struct_ops_links and becomes reachable by cgroup_bpf_release().
If bpf_link_prime() then fails (FD exhaustion or memory pressure), the
error path checks link->cgroup and calls cgroup_bpf_detach_struct_ops().
However, if the cgroup is concurrently destroyed, cgroup_bpf_release()
can run:
cgroup_bpf_release():
cgroup_lock()
list_splice_init(&cgrp->bpf.struct_ops_links, &st_links)
for each st_link in st_links:
st_link->cgroup = NULL
st_link->cgroup_removed = true
bpf_link_inc_not_zero(&st_link->link) // refcount 1->2
cgroup_unlock()
...
st_link->link.ops->detach(&st_link->link)
bpf_link_put(&st_link->link)
At this point link->cgroup is NULL, so the err_put_cgroup check becomes
a no-op. Control proceeds to err_out and calls kfree(link), freeing the
memory.
Meanwhile, cgroup_bpf_release resumes its second loop and accesses the
already-freed link memory via st_link->link.ops->detach() and
bpf_link_put().
The error path uses kfree(link) directly, bypassing the bpf_link
refcount mechanism. Once the link is added to cgrp->bpf.struct_ops_links,
it becomes exposed to concurrent cgroup_bpf_release which can bump the
refcount. Direct kfree() ignores the additional reference and frees the
memory while the release work still holds a raw pointer.
In contrast, when the post-bpf_link_prime reg() failure occurs, the code
uses bpf_link_cleanup() and sets link = NULL, which is safe because it
relies on the refcount mechanism.
> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index 876f6a81a9b6..b593ebb30a4e 100644
> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -307,12 +308,23 @@ static void cgroup_bpf_release(struct work_struct *work)
> bpf.release_work);
> struct bpf_prog_array *old_array;
> struct list_head *storages = &cgrp->bpf.storages;
> + struct bpf_struct_ops_link *st_link, *st_tmp;
> struct bpf_cgroup_storage *storage, *stmp;
> + LIST_HEAD(st_links);
>
> unsigned int atype;
>
> cgroup_lock();
>
> + list_splice_init(&cgrp->bpf.struct_ops_links, &st_links);
> + list_for_each_entry_safe(st_link, st_tmp, &st_links, list) {
> + st_link->cgroup = NULL;
> + st_link->cgroup_removed = true;
> + cgroup_put(cgrp);
> + if (IS_ERR(bpf_link_inc_not_zero(&st_link->link)))
> + list_del(&st_link->list);
> + }
Does this loop have the correct ordering for concurrent link dealloc?
The first loop writes to st_link memory (st_link->cgroup = NULL,
st_link->cgroup_removed = true) before calling bpf_link_inc_not_zero()
to take a reference.
Meanwhile, if userspace closes the link FD and the refcount drops to 0,
bpf_struct_ops_map_link_dealloc() runs:
bpf_struct_ops_map_link_dealloc():
st_map->st_ops_desc->st_ops->unreg(&st_map->kvalue.data, link);
if (st_link->cgroup)
cgroup_bpf_detach_struct_ops(st_link->cgroup, st_link);
kfree(st_link);
The dealloc path reads st_link->cgroup without holding cgroup_mutex.
If this read sees the NULL written by cgroup_bpf_release, dealloc skips
cgroup_bpf_detach_struct_ops() and proceeds to kfree(st_link).
At that point, cgroup_bpf_release continues and performs use-after-free
on the freed st_link memory:
st_link->cgroup_removed = true // UAF write
bpf_link_inc_not_zero(&st_link->link) // UAF atomic on freed memory
list_del(&st_link->list) // UAF / list corruption
The link is on cgrp->bpf.struct_ops_links but list membership does not
increment the link refcount. The cgroup_lock would only synchronize the
dealloc path if dealloc unconditionally took it, but dealloc takes
cgroup_lock conditionally on (st_link->cgroup), which the racing writer
just nulled.
Would it be safer to call bpf_link_inc_not_zero() first before touching
st_link's other fields?
> @@ -2755,3 +2773,31 @@ cgroup_common_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> +void cgroup_bpf_detach_struct_ops(struct cgroup *cgrp,
> + struct bpf_struct_ops_link *link)
> +{
> + cgroup_lock();
> + if (link->cgroup == cgrp) {
> + list_del(&link->list);
> + link->cgroup = NULL;
> + cgroup_put(cgrp);
> + }
> + cgroup_unlock();
> +}
Is the cgrp parameter needed in cgroup_bpf_attach_struct_ops() and
cgroup_bpf_detach_struct_ops()? Both Matt Bobrowski and Josh Don
suggested that both functions should operate on link->cgroup instead,
with a !link->cgroup being considered -EINVAL. Roman acknowledged with
'Ack' and 'Sure, good point' but the parameter was not removed.
Reference: https://lore.kernel.org/bpf/aXnyKw5sRt_MB-8A@google.com/
---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26429228214
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2026-05-26 3:19 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-26 2:20 [RFC PATCH bpf-next v7 00/11] mm: BPF struct_ops for dynamic memory protection and async reclaim Hui Zhu
2026-05-26 2:20 ` [RFC PATCH bpf-next v7 01/11] bpf: move bpf_struct_ops_link into bpf.h Hui Zhu
2026-05-26 2:20 ` [RFC PATCH bpf-next v7 02/11] bpf: allow attaching struct_ops to cgroups Hui Zhu
2026-05-26 3:19 ` bot+bpf-ci
2026-05-26 2:20 ` [RFC PATCH bpf-next v7 03/11] libbpf: fix return value on memory allocation failure Hui Zhu
2026-05-26 3:06 ` bot+bpf-ci
2026-05-26 2:20 ` [RFC PATCH bpf-next v7 04/11] libbpf: introduce bpf_map__attach_struct_ops_opts() Hui Zhu
2026-05-26 3:06 ` bot+bpf-ci
2026-05-26 2:20 ` [RFC PATCH bpf-next v7 05/11] bpf: Pass flags in bpf_link_create for struct_ops Hui Zhu
2026-05-26 2:24 ` [RFC PATCH bpf-next v7 06/11] mm: memcontrol: Add BPF struct_ops for memory controller Hui Zhu
2026-05-26 3:19 ` bot+bpf-ci
2026-05-26 2:24 ` [RFC PATCH bpf-next v7 07/11] mm/bpf: Add bpf_try_to_free_mem_cgroup_pages kfunc Hui Zhu
2026-05-26 3:06 ` bot+bpf-ci
2026-05-26 2:24 ` [RFC PATCH bpf-next v7 08/11] selftests/bpf: Add tests for memcg_bpf_ops Hui Zhu
2026-05-26 2:27 ` [RFC PATCH bpf-next v7 09/11] selftests/bpf: Add test for memcg_bpf_ops hierarchies Hui Zhu
2026-05-26 2:27 ` [RFC PATCH bpf-next v7 10/11] selftests/bpf: Add selftest for memcg async reclaim via BPF Hui Zhu
2026-05-26 3:06 ` bot+bpf-ci
2026-05-26 2:27 ` [RFC PATCH bpf-next v7 11/11] samples/bpf: Add memcg priority control and async reclaim example Hui Zhu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox