Linux cgroups development
 help / color / mirror / Atom feed
From: Hui Zhu <hui.zhu@linux.dev>
To: Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	John Fastabend <john.fastabend@gmail.com>,
	Andrii Nakryiko <andrii@kernel.org>,
	Martin KaFai Lau <martin.lau@linux.dev>,
	Eduard Zingerman <eddyz87@gmail.com>,
	Kumar Kartikeya Dwivedi <memxor@gmail.com>,
	Song Liu <song@kernel.org>,
	Yonghong Song <yonghong.song@linux.dev>,
	Jiri Olsa <jolsa@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@kernel.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	Muchun Song <muchun.song@linux.dev>,
	JP Kobryn <inwardvessel@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Shuah Khan <shuah@kernel.org>,
	davem@davemloft.net, Jakub Kicinski <kuba@kernel.org>,
	Jesper Dangaard Brouer <hawk@kernel.org>,
	Stanislav Fomichev <sdf@fomichev.me>,
	KP Singh <kpsingh@kernel.org>, Tao Chen <chen.dylane@linux.dev>,
	Mykyta Yatsenko <yatsenko@meta.com>,
	Leon Hwang <leon.hwang@linux.dev>,
	Anton Protopopov <a.s.protopopov@gmail.com>,
	Amery Hung <ameryhung@gmail.com>,
	Tobias Klauser <tklauser@distanz.ch>,
	Eyal Birger <eyal.birger@gmail.com>, Rong Tao <rongtao@cestc.cn>,
	Hao Luo <haoluo@google.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Miguel Ojeda <ojeda@kernel.org>,
	Nathan Chancellor <nathan@kernel.org>,
	Kees Cook <kees@kernel.org>, Tejun Heo <tj@kernel.org>,
	Jeff Xu <jeffxu@chromium.org>,
	mkoutny@suse.com, Jan Hendrik Farr <kernel@jfarr.cc>,
	Christian Brauner <brauner@kernel.org>,
	Randy Dunlap <rdunlap@infradead.org>,
	Brian Gerst <brgerst@gmail.com>,
	Masahiro Yamada <masahiroy@kernel.org>,
	Willem de Bruijn <willemb@google.com>,
	Jason Xing <kerneljasonxing@gmail.com>,
	Paul Chaignon <paul.chaignon@gmail.com>,
	Chen Ridong <chenridong@huaweicloud.com>,
	Lance Yang <lance.yang@linux.dev>,
	Jiayuan Chen <jiayuan.chen@linux.dev>,
	linux-kernel@vger.kernel.org, bpf@vger.kernel.org,
	cgroups@vger.kernel.org, linux-mm@kvack.org,
	netdev@vger.kernel.org, linux-kselftest@vger.kernel.org
Cc: geliang@kernel.org, baohua@kernel.org, Hui Zhu <zhuhui@kylinos.cn>
Subject: [RFC PATCH bpf-next v7 00/11] mm: BPF struct_ops for dynamic memory protection and async reclaim
Date: Tue, 26 May 2026 10:20:00 +0800	[thread overview]
Message-ID: <cover.1779760876.git.zhuhui@kylinos.cn> (raw)

From: Hui Zhu <zhuhui@kylinos.cn>

Overview:
This series introduces BPF struct_ops support for the memory controller,
enabling userspace BPF programs to implement custom, dynamic memory
management policies per cgroup. The feature allows BPF programs to hook
into the core reclaim and charge paths without requiring kernel
modifications, providing a flexible alternative to static knobs such as
memory.low and memory.min.
 
The series enables two complementary use cases.
 
Dynamic memory protection: static memory protection thresholds
(memory.low, memory.min) are poor fits for workloads whose actual memory
activity varies over time. A high-priority cgroup holding a large working
set but temporarily idle will still suppress reclaim on its siblings,
wasting available memory. A BPF-driven approach can observe real workload
activity -- page faults, charge/uncharge events -- and activate or
withdraw protection dynamically. The test results at the end of this
letter quantify the difference: in a scenario where the high-priority
cgroup is idle, the BPF-controlled low-priority cgroup achieves roughly
37x higher throughput than with static memory.low.
 
Asynchronous proactive reclaim: the memcg_charged and memcg_uncharged
hooks, combined with the BPF workqueue mechanism and the new
bpf_try_to_free_mem_cgroup_pages() kfunc, enable BPF programs to perform
proactive background reclaim without blocking the charge path. The
pattern works as follows: the memcg_charged callback tracks accumulated
memory usage; when usage crosses a configurable threshold, it enqueues an
asynchronous work item via bpf_wq_start() and returns immediately without
throttling the charging task. The workqueue callback then invokes
bpf_try_to_free_mem_cgroup_pages() to reclaim pages from the target
cgroup; if usage remains elevated after reclaim, the callback re-enqueues
itself to continue. This allows a BPF program to keep a cgroup's
footprint below its hard limit (memory.max) entirely in the background,
avoiding the OOM killer or direct-reclaim stalls that would otherwise
occur. The selftest for this feature (patch 10/11) validates the
mechanism concretely: a workload that writes and mmaps a 64 MB file inside
a 32 MB cgroup reliably triggers memory.events "max" events without BPF;
with the async reclaim program attached, the "max" counter does not
increase at all across the same workload.
 
In this patch series, I've incorporated a portion of Roman's patch in
[1] to ensure the entire series can be compiled cleanly with bpf-next.
 
Patch Breakdown:
Patches 1-4 are from Roman Gushchin's series [1], included here to
provide the necessary BPF infrastructure for attaching struct_ops to
cgroups.
 
Patches 5-11 are the new work in this series:
 
  05/11  bpf: Pass flags in bpf_link_create for struct_ops
         Stores attr->link_create.flags in struct bpf_struct_ops_link
         and extends the validation to allow BPF_F_ALLOW_OVERRIDE.
         Also updates the UAPI comment to reflect that cgroup-bpf attach
         flags now apply to BPF_LINK_CREATE in addition to
         BPF_PROG_ATTACH.
 
  06/11  mm: memcontrol: Add BPF struct_ops for memory controller
         The core feature patch. Introduces the memcg_bpf_ops struct_ops
         type with the following hooks:
 
         - memcg_charged(memcg, batch): called on the synchronous charge
           path. Returns a throttling delay in milliseconds; used as a
           lower bound for __mem_cgroup_handle_over_high(), effective
           even when memory.high is not breached.
 
         - memcg_uncharged(memcg, batch): called on uncharge, allowing
           BPF programs to track memory releases.
 
         - below_low(memcg, elow, usage): overrides the memory.low
           protection check. Returns true to treat the cgroup as
           protected regardless of the elow >= usage comparison.
 
         - below_min(memcg, emin, usage): same as below_low but for
           memory.min protection.
 
         - handle_cgroup_online/offline(memcg): lifecycle callbacks for
           per-cgroup state management in BPF programs.
 
         BPF_F_ALLOW_OVERRIDE is supported: when a program is registered
         with this flag, descendant cgroups may attach their own
         memcg_bpf_ops to override the inherited policy. Registration
         propagates ops down through the subtree via mem_cgroup_iter;
         unregistration restores each descendant to the ops its
         registering ancestor's parent held, correctly preserving
         override chains.
 
  07/11  mm/bpf: Add bpf_try_to_free_mem_cgroup_pages kfunc
         Exposes try_to_free_mem_cgroup_pages() to BPF programs as a
         KF_SLEEPABLE kfunc. A swappiness parameter controls the
         override value passed to the core reclaim path
         (effective only when MEMCG_RECLAIM_PROACTIVE is set in
         reclaim_options).
 
  08/11  selftests/bpf: Add tests for memcg_bpf_ops
         Adds prog_tests/memcg_ops.c covering three scenarios:
         memcg_charged-only throttling, below_low + memcg_charged
         interaction, and below_min + memcg_charged interaction. A
         tracepoint on memcg:count_memcg_events (PGFAULT) is used to
         detect memory pressure and trigger hooks accordingly.
 
  09/11  selftests/bpf: Add test for memcg_bpf_ops hierarchies
         Validates BPF_F_ALLOW_OVERRIDE attachment semantics across a
         three-level cgroup hierarchy: attach with ALLOW_OVERRIDE at the
         root, override at the middle level without the flag, then assert
         that attaching to the leaf correctly fails with -EBUSY.
 
  10/11  selftests/bpf: Add selftest for memcg async reclaim via BPF
         Demonstrates and validates asynchronous memory reclaim: a BPF
         program uses the memcg_charged/memcg_uncharged hooks to track
         accumulated usage and, when a threshold is exceeded, enqueues a
         bpf_wq_start() workqueue item that calls
         bpf_try_to_free_mem_cgroup_pages() without blocking the charge
         path. The test asserts that with the BPF program active,
         memory.events "max" events do not increase under a workload
         that would otherwise exceed the hard limit.
 
  11/11  samples/bpf: Add memcg priority control and async reclaim example
         Adds a complete sample (samples/bpf/memcg.bpf.c + memcg.c)
         demonstrating both features. The BPF side monitors PGFAULT
         events on a high-priority cgroup; when the per-second fault
         count crosses a configurable threshold, it activates below_low
         or below_min protection for the high-priority cgroup and/or
         applies a charge delay to the low-priority cgroup. Six
         struct_ops variants are exported so userspace can attach only
         the hooks needed. Async reclaim is optionally combined with
         priority throttling via a shared low-cgroup ops map.
 
Test Environment:
The following examples run on x86_64 QEMU (10 CPUs, 2 GB RAM), using
a tmpfs-backed file on the host as a swap device to reduce I/O impact.
Two cgroups are created -- high (high-priority) and low (low-priority)
-- and each test runs two concurrent stress-ng workloads, one per
cgroup, each requesting 3 GB of memory.
 
  # mkdir /sys/fs/cgroup/high /sys/fs/cgroup/low
  # free -h
                 total   used    free  shared  buff/cache  available
  Mem:           1.9Gi  317Mi  1.6Gi   1.0Mi       144Mi      1.6Gi
  Swap:          4.0Gi     0B  4.0Gi
 
Baseline: no memory priority policy:
Both cgroups run without any reclaim protection. Results are roughly
equal, as expected:
 
  cgroup    bogo ops/s
  high           4,979
  low            4,927
 
Test 1: memory.low protection:
Setting memory.low on the high-priority cgroup protects it from
reclaim, at the cost of pushing reclaim pressure onto the low-priority
cgroup:
 
  # echo $((3 * 1024 * 1024 * 1024)) > /sys/fs/cgroup/high/memory.low
 
  cgroup    bogo ops/s
  high         450,290
  low           11,307
 
The high-priority cgroup benefits significantly, but memory.low relies
on static usage thresholds and cannot adapt to actual workload
behavior.
 
Test 2: memory.low with an idle high-priority task:
Here the high-priority cgroup runs a Python script that allocates 3 GB
and then sleeps, simulating a low-activity but memory-holding workload.
Because the process is idle, it generates no page faults and does not
actively use its memory. Yet memory.low still protects it, continuing
to suppress the low-priority cgroup's performance:
 
  cgroup    bogo ops/s
  low           14,757
 
The low-priority cgroup remains significantly throttled despite the
high-priority cgroup being effectively idle -- a clear limitation of
static memory.low control.
 
Test 3: memcg eBPF -- dynamic priority control:
memcg is a sample program introduced in this patch series
(samples/bpf/memcg.c + memcg.bpf.c). It loads a BPF program that
monitors PGFAULT events in the high-priority cgroup. When the
per-second fault count exceeds a configured threshold, the hook
activates below_min protection for one second; otherwise the cgroup
receives no special treatment.
 
  # ./memcg --low_path=/sys/fs/cgroup/low  \
            --high_path=/sys/fs/cgroup/high \
            --threshold=1 --use_below_min
  Successfully attached!
 
3a. Both cgroups under active memory pressure:
 
When both cgroups run stress-ng, the high-priority cgroup generates
frequent page faults and the BPF hook activates protection, matching
the behavior of memory.low:
 
  cgroup    bogo ops/s
  high         404,392
  low           11,404
 
3b. High-priority cgroup is idle (Python + sleep):
 
Because the sleeping Python process generates no page faults, the BPF
hook never activates, and the low-priority cgroup is free to reclaim
memory normally:
 
  cgroup    bogo ops/s
  low          551,083
 
This is a ~37x improvement over the equivalent memory.low scenario
(Test 2), demonstrating that eBPF-driven dynamic control can
accurately reflect actual workload activity and avoid unnecessary
protection of idle high-priority tasks.
 
Summary:
  Scenario                          low-cgroup bogo ops/s
  Baseline (no policy)                           ~4,927
  memory.low, both active                       ~11,307
  memory.low, high idle                         ~14,757
  memcg eBPF, both active                       ~11,404
  memcg eBPF, high idle                        ~551,083
 
References:
[1] https://patchew.org/linux/20260127024421.494929-1-roman.gushchin@linux.dev/

Changelog:
v7:
Change base commits of "mm: BPF OOM" to v3.
Some fixes according to the comments of bpf-ci.
Rename get_high_delay_ms hook to memcg_charged; add memcg_uncharged
hook for tracking uncharge events.
Update below_low and below_min hooks to receive elow/emin and usage
as explicit arguments.
Add bpf_try_to_free_mem_cgroup_pages kfunc to expose cgroup reclaim
to BPF programs.
Add selftest for BPF-driven asynchronous page reclaim.
Extend samples/bpf/memcg to support async reclaim in addition to
priority throttling.
v6:
Based on the bot+bof-ci comments, fixed the following issues.
Added fast-path check with unlikely() before SRCU lock acquisition to
optimize the no-BPF case in BPF_MEMCG_CALL.
Add missing newline in pr_warn message to bpf_memcontrol_init.
Added comprehensive child process exit status checking with WIFEXITED()
and WEXITSTATUS(), and added zombie process prevention in
real_test_memcg_ops.
Changed malloc() to calloc() for BSS data allocation in all test
functions and samples main function.
Change srcu_read_lock(&memcg_bpf_srcu) to
lockdep_assert_held(&cgroup_mutex) in function memcontrol_bpf_online
and memcontrol_bpf_offline.
v5:
Based on the bot+bof-ci comments, fixed the following issues.
Fixed issues in memcg_ops.c and memcg.bpf.c by moving variable
declaration to the beginning of need_threshold() function.
The 'u64 current_ts' variable must be declared before any
executable statements
Improved input validation in samples/bpf/memcg.c by adding a new
parse_u64() helper function. This function properly handles errors
from strtoull() and provides better error messages when parsing
threshold and over_high_ms command-line arguments.
Move check for prog->sleepable after validating member offsets in
mm/bpf_memcontrol.c bpf_memcg_ops_check_member.
Fixed sscanf return value checking in prog_tests/memcg_ops.c.
Changed the condition from 'sscanf() < 0' to 'sscanf() != 1' because
sscanf returns the number of successfully matched items, not a negative
value on error. This makes the test more reliable when reading timing
data from temporary files.
v4:
Fix the issues according to the comments from bot+bof-ci.
According to JP Kobryn's comments, move exit(0) from
real_test_memcg_ops_child_work to real_test_memcg_ops.
Fix issues in the bpf_memcg_ops_reg function.
v3:
According to the comments from Michal Koutný and Chen Ridong, update hooks
to get_high_delay_ms, below_low, below_min, handle_cgroup_online, and
handle_cgroup_offline.
According to Michal Koutný's comments, add BPF_F_ALLOW_OVERRIDE
support to memcg_bpf_ops.
v2:
According to Tejun Heo's comments, rebased on Roman Gushchin's BPF
OOM patch series [1] and added hierarchical delegation support.
According to the comments from Roman Gushchin and Michal Hocko, designed
concrete use case scenarios and provided test results.

Hui Zhu (7):
  bpf: Pass flags in bpf_link_create for struct_ops
  mm: memcontrol: Add BPF struct_ops for memory controller
  mm/bpf: Add bpf_try_to_free_mem_cgroup_pages kfunc
  selftests/bpf: Add tests for memcg_bpf_ops
  selftests/bpf: Add test for memcg_bpf_ops hierarchies
  selftests/bpf: Add selftest for memcg async reclaim via BPF
  samples/bpf: Add memcg priority control and async reclaim example

Roman Gushchin (4):
  bpf: move bpf_struct_ops_link into bpf.h
  bpf: allow attaching struct_ops to cgroups
  libbpf: fix return value on memory allocation failure
  libbpf: introduce bpf_map__attach_struct_ops_opts()

 MAINTAINERS                                   |   6 +
 include/linux/bpf-cgroup-defs.h               |   3 +
 include/linux/bpf-cgroup.h                    |  16 +
 include/linux/bpf.h                           |  10 +
 include/linux/memcontrol.h                    | 250 ++++++-
 include/uapi/linux/bpf.h                      |   5 +-
 kernel/bpf/bpf_struct_ops.c                   |  67 +-
 kernel/bpf/cgroup.c                           |  46 ++
 mm/bpf_memcontrol.c                           | 355 +++++++++-
 mm/memcontrol.c                               |  43 +-
 samples/bpf/.gitignore                        |   1 +
 samples/bpf/Makefile                          |   8 +-
 samples/bpf/memcg.bpf.c                       | 380 +++++++++++
 samples/bpf/memcg.c                           | 411 ++++++++++++
 tools/include/uapi/linux/bpf.h                |   3 +-
 tools/lib/bpf/libbpf.c                        |  22 +-
 tools/lib/bpf/libbpf.h                        |  14 +
 tools/lib/bpf/libbpf.map                      |   1 +
 tools/testing/selftests/bpf/cgroup_helpers.c  |  41 ++
 tools/testing/selftests/bpf/cgroup_helpers.h  |   2 +
 .../bpf/prog_tests/memcg_async_reclaim.c      | 333 +++++++++
 .../selftests/bpf/prog_tests/memcg_ops.c      | 634 ++++++++++++++++++
 .../selftests/bpf/progs/memcg_async_reclaim.c | 203 ++++++
 tools/testing/selftests/bpf/progs/memcg_ops.c | 132 ++++
 24 files changed, 2952 insertions(+), 34 deletions(-)
 create mode 100644 samples/bpf/memcg.bpf.c
 create mode 100644 samples/bpf/memcg.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/memcg_async_reclaim.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/memcg_ops.c
 create mode 100644 tools/testing/selftests/bpf/progs/memcg_async_reclaim.c
 create mode 100644 tools/testing/selftests/bpf/progs/memcg_ops.c

-- 
2.43.0


             reply	other threads:[~2026-05-26  2:21 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-26  2:20 Hui Zhu [this message]
2026-05-26  2:20 ` [RFC PATCH bpf-next v7 01/11] bpf: move bpf_struct_ops_link into bpf.h Hui Zhu
2026-05-26  2:20 ` [RFC PATCH bpf-next v7 02/11] bpf: allow attaching struct_ops to cgroups Hui Zhu
2026-05-26  3:19   ` bot+bpf-ci
2026-05-26  2:20 ` [RFC PATCH bpf-next v7 03/11] libbpf: fix return value on memory allocation failure Hui Zhu
2026-05-26  3:06   ` bot+bpf-ci
2026-05-26  2:20 ` [RFC PATCH bpf-next v7 04/11] libbpf: introduce bpf_map__attach_struct_ops_opts() Hui Zhu
2026-05-26  3:06   ` bot+bpf-ci
2026-05-26  2:20 ` [RFC PATCH bpf-next v7 05/11] bpf: Pass flags in bpf_link_create for struct_ops Hui Zhu
2026-05-26  2:24 ` [RFC PATCH bpf-next v7 06/11] mm: memcontrol: Add BPF struct_ops for memory controller Hui Zhu
2026-05-26  3:19   ` bot+bpf-ci
2026-05-26  2:24 ` [RFC PATCH bpf-next v7 07/11] mm/bpf: Add bpf_try_to_free_mem_cgroup_pages kfunc Hui Zhu
2026-05-26  3:06   ` bot+bpf-ci
2026-05-26  2:24 ` [RFC PATCH bpf-next v7 08/11] selftests/bpf: Add tests for memcg_bpf_ops Hui Zhu
2026-05-26  2:27 ` [RFC PATCH bpf-next v7 09/11] selftests/bpf: Add test for memcg_bpf_ops hierarchies Hui Zhu
2026-05-26  2:27 ` [RFC PATCH bpf-next v7 10/11] selftests/bpf: Add selftest for memcg async reclaim via BPF Hui Zhu
2026-05-26  3:06   ` bot+bpf-ci
2026-05-26  2:27 ` [RFC PATCH bpf-next v7 11/11] samples/bpf: Add memcg priority control and async reclaim example Hui Zhu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cover.1779760876.git.zhuhui@kylinos.cn \
    --to=hui.zhu@linux.dev \
    --cc=a.s.protopopov@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=ameryhung@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=baohua@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=brauner@kernel.org \
    --cc=brgerst@gmail.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chen.dylane@linux.dev \
    --cc=chenridong@huaweicloud.com \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=eddyz87@gmail.com \
    --cc=eyal.birger@gmail.com \
    --cc=geliang@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=haoluo@google.com \
    --cc=hawk@kernel.org \
    --cc=inwardvessel@gmail.com \
    --cc=jeffxu@chromium.org \
    --cc=jiayuan.chen@linux.dev \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kees@kernel.org \
    --cc=kernel@jfarr.cc \
    --cc=kerneljasonxing@gmail.com \
    --cc=kpsingh@kernel.org \
    --cc=kuba@kernel.org \
    --cc=lance.yang@linux.dev \
    --cc=leon.hwang@linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=martin.lau@linux.dev \
    --cc=masahiroy@kernel.org \
    --cc=memxor@gmail.com \
    --cc=mhocko@kernel.org \
    --cc=mkoutny@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=nathan@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=ojeda@kernel.org \
    --cc=paul.chaignon@gmail.com \
    --cc=peterz@infradead.org \
    --cc=rdunlap@infradead.org \
    --cc=roman.gushchin@linux.dev \
    --cc=rongtao@cestc.cn \
    --cc=sdf@fomichev.me \
    --cc=shakeel.butt@linux.dev \
    --cc=shuah@kernel.org \
    --cc=song@kernel.org \
    --cc=tj@kernel.org \
    --cc=tklauser@distanz.ch \
    --cc=willemb@google.com \
    --cc=yatsenko@meta.com \
    --cc=yonghong.song@linux.dev \
    --cc=zhuhui@kylinos.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox