From: Usama Arif <usama.arif@linux.dev>
To: Hui Zhu <hui.zhu@linux.dev>
Cc: Usama Arif <usama.arif@linux.dev>,
Daniel Borkmann <daniel@iogearbox.net>,
John Fastabend <john.fastabend@gmail.com>,
Andrii Nakryiko <andrii@kernel.org>,
Martin KaFai Lau <martin.lau@linux.dev>,
Eduard Zingerman <eddyz87@gmail.com>,
Kumar Kartikeya Dwivedi <memxor@gmail.com>,
Song Liu <song@kernel.org>,
Yonghong Song <yonghong.song@linux.dev>,
Jiri Olsa <jolsa@kernel.org>,
Johannes Weiner <hannes@cmpxchg.org>,
Michal Hocko <mhocko@kernel.org>,
Roman Gushchin <roman.gushchin@linux.dev>,
Shakeel Butt <shakeel.butt@linux.dev>,
Muchun Song <muchun.song@linux.dev>,
JP Kobryn <inwardvessel@gmail.com>,
Andrew Morton <akpm@linux-foundation.org>,
Shuah Khan <shuah@kernel.org>,
davem@davemloft.net, Jakub Kicinski <kuba@kernel.org>,
Jesper Dangaard Brouer <hawk@kernel.org>,
Stanislav Fomichev <sdf@fomichev.me>,
KP Singh <kpsingh@kernel.org>, Tao Chen <chen.dylane@linux.dev>,
Mykyta Yatsenko <yatsenko@meta.com>,
Leon Hwang <leon.hwang@linux.dev>,
Anton Protopopov <a.s.protopopov@gmail.com>,
Amery Hung <ameryhung@gmail.com>,
Tobias Klauser <tklauser@distanz.ch>,
Eyal Birger <eyal.birger@gmail.com>, Rong Tao <rongtao@cestc.cn>,
Hao Luo <haoluo@google.com>,
Peter Zijlstra <peterz@infradead.org>,
Miguel Ojeda <ojeda@kernel.org>,
Nathan Chancellor <nathan@kernel.org>,
Kees Cook <kees@kernel.org>, Tejun Heo <tj@kernel.org>,
Jeff Xu <jeffxu@chromium.org>,
mkoutny@suse.com, Jan Hendrik Farr <kernel@jfarr.cc>,
Christian Brauner <brauner@kernel.org>,
Randy Dunlap <rdunlap@infradead.org>,
Brian Gerst <brgerst@gmail.com>,
Masahiro Yamada <masahiroy@kernel.org>,
Willem de Bruijn <willemb@google.com>,
Jason Xing <kerneljasonxing@gmail.com>,
Paul Chaignon <paul.chaignon@gmail.com>,
Chen Ridong <chenridong@huaweicloud.com>,
Lance Yang <lance.yang@linux.dev>,
Jiayuan Chen <jiayuan.chen@linux.dev>,
linux-kernel@vger.kernel.org, bpf@vger.kernel.org,
cgroups@vger.kernel.org, linux-mm@kvack.org,
netdev@vger.kernel.org, linux-kselftest@vger.kernel.org,
geliang@kernel.org, baohua@kernel.org,
Hui Zhu <zhuhui@kylinos.cn>
Subject: Re: [RFC PATCH bpf-next v7 00/11] mm: BPF struct_ops for dynamic memory protection and async reclaim
Date: Tue, 26 May 2026 06:41:02 -0700 [thread overview]
Message-ID: <20260526134115.816081-1-usama.arif@linux.dev> (raw)
In-Reply-To: <cover.1779760876.git.zhuhui@kylinos.cn>
On Tue, 26 May 2026 10:20:00 +0800 Hui Zhu <hui.zhu@linux.dev> wrote:
> From: Hui Zhu <zhuhui@kylinos.cn>
>
> Overview:
> This series introduces BPF struct_ops support for the memory controller,
> enabling userspace BPF programs to implement custom, dynamic memory
> management policies per cgroup. The feature allows BPF programs to hook
> into the core reclaim and charge paths without requiring kernel
> modifications, providing a flexible alternative to static knobs such as
> memory.low and memory.min.
>
> The series enables two complementary use cases.
>
> Dynamic memory protection: static memory protection thresholds
> (memory.low, memory.min) are poor fits for workloads whose actual memory
> activity varies over time. A high-priority cgroup holding a large working
> set but temporarily idle will still suppress reclaim on its siblings,
> wasting available memory. A BPF-driven approach can observe real workload
> activity -- page faults, charge/uncharge events -- and activate or
> withdraw protection dynamically. The test results at the end of this
> letter quantify the difference: in a scenario where the high-priority
> cgroup is idle, the BPF-controlled low-priority cgroup achieves roughly
> 37x higher throughput than with static memory.low.
>
> Asynchronous proactive reclaim: the memcg_charged and memcg_uncharged
> hooks, combined with the BPF workqueue mechanism and the new
> bpf_try_to_free_mem_cgroup_pages() kfunc, enable BPF programs to perform
> proactive background reclaim without blocking the charge path. The
> pattern works as follows: the memcg_charged callback tracks accumulated
> memory usage; when usage crosses a configurable threshold, it enqueues an
> asynchronous work item via bpf_wq_start() and returns immediately without
> throttling the charging task. The workqueue callback then invokes
> bpf_try_to_free_mem_cgroup_pages() to reclaim pages from the target
> cgroup; if usage remains elevated after reclaim, the callback re-enqueues
> itself to continue. This allows a BPF program to keep a cgroup's
> footprint below its hard limit (memory.max) entirely in the background,
> avoiding the OOM killer or direct-reclaim stalls that would otherwise
> occur. The selftest for this feature (patch 10/11) validates the
> mechanism concretely: a workload that writes and mmaps a 64 MB file inside
> a 32 MB cgroup reliably triggers memory.events "max" events without BPF;
> with the async reclaim program attached, the "max" counter does not
> increase at all across the same workload.
>
Hi Hui,
Thanks for the series.
Would it not be simpler to just have another memcg knob, something like
memory.high_async.
When memory usage > memory.high_async, queue a per-memcg work item that calls
try_to_free_mem_cgroup_pages() until usage drops back below some threshold.
I am not sure I see what programability aspect from bpf you need here.
Thanks
>
> 08/11 selftests/bpf: Add tests for memcg_bpf_ops
> Adds prog_tests/memcg_ops.c covering three scenarios:
> memcg_charged-only throttling, below_low + memcg_charged
> interaction, and below_min + memcg_charged interaction. A
> tracepoint on memcg:count_memcg_events (PGFAULT) is used to
> detect memory pressure and trigger hooks accordingly.
>
> 09/11 selftests/bpf: Add test for memcg_bpf_ops hierarchies
> Validates BPF_F_ALLOW_OVERRIDE attachment semantics across a
> three-level cgroup hierarchy: attach with ALLOW_OVERRIDE at the
> root, override at the middle level without the flag, then assert
> that attaching to the leaf correctly fails with -EBUSY.
>
> 10/11 selftests/bpf: Add selftest for memcg async reclaim via BPF
> Demonstrates and validates asynchronous memory reclaim: a BPF
> program uses the memcg_charged/memcg_uncharged hooks to track
> accumulated usage and, when a threshold is exceeded, enqueues a
> bpf_wq_start() workqueue item that calls
> bpf_try_to_free_mem_cgroup_pages() without blocking the charge
> path. The test asserts that with the BPF program active,
> memory.events "max" events do not increase under a workload
> that would otherwise exceed the hard limit.
>
> 11/11 samples/bpf: Add memcg priority control and async reclaim example
> Adds a complete sample (samples/bpf/memcg.bpf.c + memcg.c)
> demonstrating both features. The BPF side monitors PGFAULT
> events on a high-priority cgroup; when the per-second fault
> count crosses a configurable threshold, it activates below_low
> or below_min protection for the high-priority cgroup and/or
> applies a charge delay to the low-priority cgroup. Six
> struct_ops variants are exported so userspace can attach only
> the hooks needed. Async reclaim is optionally combined with
> priority throttling via a shared low-cgroup ops map.
>
> Test Environment:
> The following examples run on x86_64 QEMU (10 CPUs, 2 GB RAM), using
> a tmpfs-backed file on the host as a swap device to reduce I/O impact.
> Two cgroups are created -- high (high-priority) and low (low-priority)
> -- and each test runs two concurrent stress-ng workloads, one per
> cgroup, each requesting 3 GB of memory.
>
> # mkdir /sys/fs/cgroup/high /sys/fs/cgroup/low
> # free -h
> total used free shared buff/cache available
> Mem: 1.9Gi 317Mi 1.6Gi 1.0Mi 144Mi 1.6Gi
> Swap: 4.0Gi 0B 4.0Gi
>
> Baseline: no memory priority policy:
> Both cgroups run without any reclaim protection. Results are roughly
> equal, as expected:
>
> cgroup bogo ops/s
> high 4,979
> low 4,927
>
> Test 1: memory.low protection:
> Setting memory.low on the high-priority cgroup protects it from
> reclaim, at the cost of pushing reclaim pressure onto the low-priority
> cgroup:
>
> # echo $((3 * 1024 * 1024 * 1024)) > /sys/fs/cgroup/high/memory.low
>
> cgroup bogo ops/s
> high 450,290
> low 11,307
>
> The high-priority cgroup benefits significantly, but memory.low relies
> on static usage thresholds and cannot adapt to actual workload
> behavior.
>
> Test 2: memory.low with an idle high-priority task:
> Here the high-priority cgroup runs a Python script that allocates 3 GB
> and then sleeps, simulating a low-activity but memory-holding workload.
> Because the process is idle, it generates no page faults and does not
> actively use its memory. Yet memory.low still protects it, continuing
> to suppress the low-priority cgroup's performance:
>
> cgroup bogo ops/s
> low 14,757
>
> The low-priority cgroup remains significantly throttled despite the
> high-priority cgroup being effectively idle -- a clear limitation of
> static memory.low control.
>
> Test 3: memcg eBPF -- dynamic priority control:
> memcg is a sample program introduced in this patch series
> (samples/bpf/memcg.c + memcg.bpf.c). It loads a BPF program that
> monitors PGFAULT events in the high-priority cgroup. When the
> per-second fault count exceeds a configured threshold, the hook
> activates below_min protection for one second; otherwise the cgroup
> receives no special treatment.
>
> # ./memcg --low_path=/sys/fs/cgroup/low \
> --high_path=/sys/fs/cgroup/high \
> --threshold=1 --use_below_min
> Successfully attached!
>
> 3a. Both cgroups under active memory pressure:
>
> When both cgroups run stress-ng, the high-priority cgroup generates
> frequent page faults and the BPF hook activates protection, matching
> the behavior of memory.low:
>
> cgroup bogo ops/s
> high 404,392
> low 11,404
>
> 3b. High-priority cgroup is idle (Python + sleep):
>
> Because the sleeping Python process generates no page faults, the BPF
> hook never activates, and the low-priority cgroup is free to reclaim
> memory normally:
>
> cgroup bogo ops/s
> low 551,083
>
> This is a ~37x improvement over the equivalent memory.low scenario
> (Test 2), demonstrating that eBPF-driven dynamic control can
> accurately reflect actual workload activity and avoid unnecessary
> protection of idle high-priority tasks.
>
> Summary:
> Scenario low-cgroup bogo ops/s
> Baseline (no policy) ~4,927
> memory.low, both active ~11,307
> memory.low, high idle ~14,757
> memcg eBPF, both active ~11,404
> memcg eBPF, high idle ~551,083
>
> References:
> [1] https://patchew.org/linux/20260127024421.494929-1-roman.gushchin@linux.dev/
>
> Changelog:
> v7:
> Change base commits of "mm: BPF OOM" to v3.
> Some fixes according to the comments of bpf-ci.
> Rename get_high_delay_ms hook to memcg_charged; add memcg_uncharged
> hook for tracking uncharge events.
> Update below_low and below_min hooks to receive elow/emin and usage
> as explicit arguments.
> Add bpf_try_to_free_mem_cgroup_pages kfunc to expose cgroup reclaim
> to BPF programs.
> Add selftest for BPF-driven asynchronous page reclaim.
> Extend samples/bpf/memcg to support async reclaim in addition to
> priority throttling.
> v6:
> Based on the bot+bof-ci comments, fixed the following issues.
> Added fast-path check with unlikely() before SRCU lock acquisition to
> optimize the no-BPF case in BPF_MEMCG_CALL.
> Add missing newline in pr_warn message to bpf_memcontrol_init.
> Added comprehensive child process exit status checking with WIFEXITED()
> and WEXITSTATUS(), and added zombie process prevention in
> real_test_memcg_ops.
> Changed malloc() to calloc() for BSS data allocation in all test
> functions and samples main function.
> Change srcu_read_lock(&memcg_bpf_srcu) to
> lockdep_assert_held(&cgroup_mutex) in function memcontrol_bpf_online
> and memcontrol_bpf_offline.
> v5:
> Based on the bot+bof-ci comments, fixed the following issues.
> Fixed issues in memcg_ops.c and memcg.bpf.c by moving variable
> declaration to the beginning of need_threshold() function.
> The 'u64 current_ts' variable must be declared before any
> executable statements
> Improved input validation in samples/bpf/memcg.c by adding a new
> parse_u64() helper function. This function properly handles errors
> from strtoull() and provides better error messages when parsing
> threshold and over_high_ms command-line arguments.
> Move check for prog->sleepable after validating member offsets in
> mm/bpf_memcontrol.c bpf_memcg_ops_check_member.
> Fixed sscanf return value checking in prog_tests/memcg_ops.c.
> Changed the condition from 'sscanf() < 0' to 'sscanf() != 1' because
> sscanf returns the number of successfully matched items, not a negative
> value on error. This makes the test more reliable when reading timing
> data from temporary files.
> v4:
> Fix the issues according to the comments from bot+bof-ci.
> According to JP Kobryn's comments, move exit(0) from
> real_test_memcg_ops_child_work to real_test_memcg_ops.
> Fix issues in the bpf_memcg_ops_reg function.
> v3:
> According to the comments from Michal Koutný and Chen Ridong, update hooks
> to get_high_delay_ms, below_low, below_min, handle_cgroup_online, and
> handle_cgroup_offline.
> According to Michal Koutný's comments, add BPF_F_ALLOW_OVERRIDE
> support to memcg_bpf_ops.
> v2:
> According to Tejun Heo's comments, rebased on Roman Gushchin's BPF
> OOM patch series [1] and added hierarchical delegation support.
> According to the comments from Roman Gushchin and Michal Hocko, designed
> concrete use case scenarios and provided test results.
>
> Hui Zhu (7):
> bpf: Pass flags in bpf_link_create for struct_ops
> mm: memcontrol: Add BPF struct_ops for memory controller
> mm/bpf: Add bpf_try_to_free_mem_cgroup_pages kfunc
> selftests/bpf: Add tests for memcg_bpf_ops
> selftests/bpf: Add test for memcg_bpf_ops hierarchies
> selftests/bpf: Add selftest for memcg async reclaim via BPF
> samples/bpf: Add memcg priority control and async reclaim example
>
> Roman Gushchin (4):
> bpf: move bpf_struct_ops_link into bpf.h
> bpf: allow attaching struct_ops to cgroups
> libbpf: fix return value on memory allocation failure
> libbpf: introduce bpf_map__attach_struct_ops_opts()
>
> MAINTAINERS | 6 +
> include/linux/bpf-cgroup-defs.h | 3 +
> include/linux/bpf-cgroup.h | 16 +
> include/linux/bpf.h | 10 +
> include/linux/memcontrol.h | 250 ++++++-
> include/uapi/linux/bpf.h | 5 +-
> kernel/bpf/bpf_struct_ops.c | 67 +-
> kernel/bpf/cgroup.c | 46 ++
> mm/bpf_memcontrol.c | 355 +++++++++-
> mm/memcontrol.c | 43 +-
> samples/bpf/.gitignore | 1 +
> samples/bpf/Makefile | 8 +-
> samples/bpf/memcg.bpf.c | 380 +++++++++++
> samples/bpf/memcg.c | 411 ++++++++++++
> tools/include/uapi/linux/bpf.h | 3 +-
> tools/lib/bpf/libbpf.c | 22 +-
> tools/lib/bpf/libbpf.h | 14 +
> tools/lib/bpf/libbpf.map | 1 +
> tools/testing/selftests/bpf/cgroup_helpers.c | 41 ++
> tools/testing/selftests/bpf/cgroup_helpers.h | 2 +
> .../bpf/prog_tests/memcg_async_reclaim.c | 333 +++++++++
> .../selftests/bpf/prog_tests/memcg_ops.c | 634 ++++++++++++++++++
> .../selftests/bpf/progs/memcg_async_reclaim.c | 203 ++++++
> tools/testing/selftests/bpf/progs/memcg_ops.c | 132 ++++
> 24 files changed, 2952 insertions(+), 34 deletions(-)
> create mode 100644 samples/bpf/memcg.bpf.c
> create mode 100644 samples/bpf/memcg.c
> create mode 100644 tools/testing/selftests/bpf/prog_tests/memcg_async_reclaim.c
> create mode 100644 tools/testing/selftests/bpf/prog_tests/memcg_ops.c
> create mode 100644 tools/testing/selftests/bpf/progs/memcg_async_reclaim.c
> create mode 100644 tools/testing/selftests/bpf/progs/memcg_ops.c
>
> --
> 2.43.0
>
>
prev parent reply other threads:[~2026-05-26 13:41 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-26 2:20 [RFC PATCH bpf-next v7 00/11] mm: BPF struct_ops for dynamic memory protection and async reclaim Hui Zhu
2026-05-26 2:20 ` [RFC PATCH bpf-next v7 01/11] bpf: move bpf_struct_ops_link into bpf.h Hui Zhu
2026-05-26 2:20 ` [RFC PATCH bpf-next v7 02/11] bpf: allow attaching struct_ops to cgroups Hui Zhu
2026-05-26 3:19 ` bot+bpf-ci
2026-05-26 2:20 ` [RFC PATCH bpf-next v7 03/11] libbpf: fix return value on memory allocation failure Hui Zhu
2026-05-26 3:06 ` bot+bpf-ci
2026-05-26 2:20 ` [RFC PATCH bpf-next v7 04/11] libbpf: introduce bpf_map__attach_struct_ops_opts() Hui Zhu
2026-05-26 3:06 ` bot+bpf-ci
2026-05-26 2:20 ` [RFC PATCH bpf-next v7 05/11] bpf: Pass flags in bpf_link_create for struct_ops Hui Zhu
2026-05-26 2:24 ` [RFC PATCH bpf-next v7 06/11] mm: memcontrol: Add BPF struct_ops for memory controller Hui Zhu
2026-05-26 3:19 ` bot+bpf-ci
2026-05-26 2:24 ` [RFC PATCH bpf-next v7 07/11] mm/bpf: Add bpf_try_to_free_mem_cgroup_pages kfunc Hui Zhu
2026-05-26 3:06 ` bot+bpf-ci
2026-05-26 2:24 ` [RFC PATCH bpf-next v7 08/11] selftests/bpf: Add tests for memcg_bpf_ops Hui Zhu
2026-05-26 2:27 ` [RFC PATCH bpf-next v7 09/11] selftests/bpf: Add test for memcg_bpf_ops hierarchies Hui Zhu
2026-05-26 2:27 ` [RFC PATCH bpf-next v7 10/11] selftests/bpf: Add selftest for memcg async reclaim via BPF Hui Zhu
2026-05-26 3:06 ` bot+bpf-ci
2026-05-26 2:27 ` [RFC PATCH bpf-next v7 11/11] samples/bpf: Add memcg priority control and async reclaim example Hui Zhu
2026-05-26 13:41 ` Usama Arif [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260526134115.816081-1-usama.arif@linux.dev \
--to=usama.arif@linux.dev \
--cc=a.s.protopopov@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=ameryhung@gmail.com \
--cc=andrii@kernel.org \
--cc=baohua@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=brauner@kernel.org \
--cc=brgerst@gmail.com \
--cc=cgroups@vger.kernel.org \
--cc=chen.dylane@linux.dev \
--cc=chenridong@huaweicloud.com \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=eddyz87@gmail.com \
--cc=eyal.birger@gmail.com \
--cc=geliang@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=haoluo@google.com \
--cc=hawk@kernel.org \
--cc=hui.zhu@linux.dev \
--cc=inwardvessel@gmail.com \
--cc=jeffxu@chromium.org \
--cc=jiayuan.chen@linux.dev \
--cc=john.fastabend@gmail.com \
--cc=jolsa@kernel.org \
--cc=kees@kernel.org \
--cc=kernel@jfarr.cc \
--cc=kerneljasonxing@gmail.com \
--cc=kpsingh@kernel.org \
--cc=kuba@kernel.org \
--cc=lance.yang@linux.dev \
--cc=leon.hwang@linux.dev \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=martin.lau@linux.dev \
--cc=masahiroy@kernel.org \
--cc=memxor@gmail.com \
--cc=mhocko@kernel.org \
--cc=mkoutny@suse.com \
--cc=muchun.song@linux.dev \
--cc=nathan@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=ojeda@kernel.org \
--cc=paul.chaignon@gmail.com \
--cc=peterz@infradead.org \
--cc=rdunlap@infradead.org \
--cc=roman.gushchin@linux.dev \
--cc=rongtao@cestc.cn \
--cc=sdf@fomichev.me \
--cc=shakeel.butt@linux.dev \
--cc=shuah@kernel.org \
--cc=song@kernel.org \
--cc=tj@kernel.org \
--cc=tklauser@distanz.ch \
--cc=willemb@google.com \
--cc=yatsenko@meta.com \
--cc=yonghong.song@linux.dev \
--cc=zhuhui@kylinos.cn \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox