Netdev List
 help / color / mirror / Atom feed
From: Usama Arif <usama.arif@linux.dev>
To: Hui Zhu <hui.zhu@linux.dev>
Cc: Usama Arif <usama.arif@linux.dev>,
	Daniel Borkmann <daniel@iogearbox.net>,
	John Fastabend <john.fastabend@gmail.com>,
	Andrii Nakryiko <andrii@kernel.org>,
	Martin KaFai Lau <martin.lau@linux.dev>,
	Eduard Zingerman <eddyz87@gmail.com>,
	Kumar Kartikeya Dwivedi <memxor@gmail.com>,
	Song Liu <song@kernel.org>,
	Yonghong Song <yonghong.song@linux.dev>,
	Jiri Olsa <jolsa@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@kernel.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	Muchun Song <muchun.song@linux.dev>,
	JP Kobryn <inwardvessel@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Shuah Khan <shuah@kernel.org>,
	davem@davemloft.net, Jakub Kicinski <kuba@kernel.org>,
	Jesper Dangaard Brouer <hawk@kernel.org>,
	Stanislav Fomichev <sdf@fomichev.me>,
	KP Singh <kpsingh@kernel.org>, Tao Chen <chen.dylane@linux.dev>,
	Mykyta Yatsenko <yatsenko@meta.com>,
	Leon Hwang <leon.hwang@linux.dev>,
	Anton Protopopov <a.s.protopopov@gmail.com>,
	Amery Hung <ameryhung@gmail.com>,
	Tobias Klauser <tklauser@distanz.ch>,
	Eyal Birger <eyal.birger@gmail.com>, Rong Tao <rongtao@cestc.cn>,
	Hao Luo <haoluo@google.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Miguel Ojeda <ojeda@kernel.org>,
	Nathan Chancellor <nathan@kernel.org>,
	Kees Cook <kees@kernel.org>, Tejun Heo <tj@kernel.org>,
	Jeff Xu <jeffxu@chromium.org>,
	mkoutny@suse.com, Jan Hendrik Farr <kernel@jfarr.cc>,
	Christian Brauner <brauner@kernel.org>,
	Randy Dunlap <rdunlap@infradead.org>,
	Brian Gerst <brgerst@gmail.com>,
	Masahiro Yamada <masahiroy@kernel.org>,
	Willem de Bruijn <willemb@google.com>,
	Jason Xing <kerneljasonxing@gmail.com>,
	Paul Chaignon <paul.chaignon@gmail.com>,
	Chen Ridong <chenridong@huaweicloud.com>,
	Lance Yang <lance.yang@linux.dev>,
	Jiayuan Chen <jiayuan.chen@linux.dev>,
	linux-kernel@vger.kernel.org, bpf@vger.kernel.org,
	cgroups@vger.kernel.org, linux-mm@kvack.org,
	netdev@vger.kernel.org, linux-kselftest@vger.kernel.org,
	geliang@kernel.org, baohua@kernel.org,
	Hui Zhu <zhuhui@kylinos.cn>
Subject: Re: [RFC PATCH bpf-next v7 00/11] mm: BPF struct_ops for dynamic memory protection and async reclaim
Date: Tue, 26 May 2026 06:41:02 -0700	[thread overview]
Message-ID: <20260526134115.816081-1-usama.arif@linux.dev> (raw)
In-Reply-To: <cover.1779760876.git.zhuhui@kylinos.cn>

On Tue, 26 May 2026 10:20:00 +0800 Hui Zhu <hui.zhu@linux.dev> wrote:

> From: Hui Zhu <zhuhui@kylinos.cn>
> 
> Overview:
> This series introduces BPF struct_ops support for the memory controller,
> enabling userspace BPF programs to implement custom, dynamic memory
> management policies per cgroup. The feature allows BPF programs to hook
> into the core reclaim and charge paths without requiring kernel
> modifications, providing a flexible alternative to static knobs such as
> memory.low and memory.min.
>  
> The series enables two complementary use cases.
>  
> Dynamic memory protection: static memory protection thresholds
> (memory.low, memory.min) are poor fits for workloads whose actual memory
> activity varies over time. A high-priority cgroup holding a large working
> set but temporarily idle will still suppress reclaim on its siblings,
> wasting available memory. A BPF-driven approach can observe real workload
> activity -- page faults, charge/uncharge events -- and activate or
> withdraw protection dynamically. The test results at the end of this
> letter quantify the difference: in a scenario where the high-priority
> cgroup is idle, the BPF-controlled low-priority cgroup achieves roughly
> 37x higher throughput than with static memory.low.
>  
> Asynchronous proactive reclaim: the memcg_charged and memcg_uncharged
> hooks, combined with the BPF workqueue mechanism and the new
> bpf_try_to_free_mem_cgroup_pages() kfunc, enable BPF programs to perform
> proactive background reclaim without blocking the charge path. The
> pattern works as follows: the memcg_charged callback tracks accumulated
> memory usage; when usage crosses a configurable threshold, it enqueues an
> asynchronous work item via bpf_wq_start() and returns immediately without
> throttling the charging task. The workqueue callback then invokes
> bpf_try_to_free_mem_cgroup_pages() to reclaim pages from the target
> cgroup; if usage remains elevated after reclaim, the callback re-enqueues
> itself to continue. This allows a BPF program to keep a cgroup's
> footprint below its hard limit (memory.max) entirely in the background,
> avoiding the OOM killer or direct-reclaim stalls that would otherwise
> occur. The selftest for this feature (patch 10/11) validates the
> mechanism concretely: a workload that writes and mmaps a 64 MB file inside
> a 32 MB cgroup reliably triggers memory.events "max" events without BPF;
> with the async reclaim program attached, the "max" counter does not
> increase at all across the same workload.
>  


Hi Hui,

Thanks for the series.
Would it not be simpler to just have another memcg knob, something like
memory.high_async.
When memory usage > memory.high_async, queue a per-memcg work item that calls
try_to_free_mem_cgroup_pages() until usage drops back below some threshold.
I am not sure I see what programability aspect from bpf you need here.

Thanks

>  
>   08/11  selftests/bpf: Add tests for memcg_bpf_ops
>          Adds prog_tests/memcg_ops.c covering three scenarios:
>          memcg_charged-only throttling, below_low + memcg_charged
>          interaction, and below_min + memcg_charged interaction. A
>          tracepoint on memcg:count_memcg_events (PGFAULT) is used to
>          detect memory pressure and trigger hooks accordingly.
>  
>   09/11  selftests/bpf: Add test for memcg_bpf_ops hierarchies
>          Validates BPF_F_ALLOW_OVERRIDE attachment semantics across a
>          three-level cgroup hierarchy: attach with ALLOW_OVERRIDE at the
>          root, override at the middle level without the flag, then assert
>          that attaching to the leaf correctly fails with -EBUSY.
>  
>   10/11  selftests/bpf: Add selftest for memcg async reclaim via BPF
>          Demonstrates and validates asynchronous memory reclaim: a BPF
>          program uses the memcg_charged/memcg_uncharged hooks to track
>          accumulated usage and, when a threshold is exceeded, enqueues a
>          bpf_wq_start() workqueue item that calls
>          bpf_try_to_free_mem_cgroup_pages() without blocking the charge
>          path. The test asserts that with the BPF program active,
>          memory.events "max" events do not increase under a workload
>          that would otherwise exceed the hard limit.
>  
>   11/11  samples/bpf: Add memcg priority control and async reclaim example
>          Adds a complete sample (samples/bpf/memcg.bpf.c + memcg.c)
>          demonstrating both features. The BPF side monitors PGFAULT
>          events on a high-priority cgroup; when the per-second fault
>          count crosses a configurable threshold, it activates below_low
>          or below_min protection for the high-priority cgroup and/or
>          applies a charge delay to the low-priority cgroup. Six
>          struct_ops variants are exported so userspace can attach only
>          the hooks needed. Async reclaim is optionally combined with
>          priority throttling via a shared low-cgroup ops map.
>  
> Test Environment:
> The following examples run on x86_64 QEMU (10 CPUs, 2 GB RAM), using
> a tmpfs-backed file on the host as a swap device to reduce I/O impact.
> Two cgroups are created -- high (high-priority) and low (low-priority)
> -- and each test runs two concurrent stress-ng workloads, one per
> cgroup, each requesting 3 GB of memory.
>  
>   # mkdir /sys/fs/cgroup/high /sys/fs/cgroup/low
>   # free -h
>                  total   used    free  shared  buff/cache  available
>   Mem:           1.9Gi  317Mi  1.6Gi   1.0Mi       144Mi      1.6Gi
>   Swap:          4.0Gi     0B  4.0Gi
>  
> Baseline: no memory priority policy:
> Both cgroups run without any reclaim protection. Results are roughly
> equal, as expected:
>  
>   cgroup    bogo ops/s
>   high           4,979
>   low            4,927
>  
> Test 1: memory.low protection:
> Setting memory.low on the high-priority cgroup protects it from
> reclaim, at the cost of pushing reclaim pressure onto the low-priority
> cgroup:
>  
>   # echo $((3 * 1024 * 1024 * 1024)) > /sys/fs/cgroup/high/memory.low
>  
>   cgroup    bogo ops/s
>   high         450,290
>   low           11,307
>  
> The high-priority cgroup benefits significantly, but memory.low relies
> on static usage thresholds and cannot adapt to actual workload
> behavior.
>  
> Test 2: memory.low with an idle high-priority task:
> Here the high-priority cgroup runs a Python script that allocates 3 GB
> and then sleeps, simulating a low-activity but memory-holding workload.
> Because the process is idle, it generates no page faults and does not
> actively use its memory. Yet memory.low still protects it, continuing
> to suppress the low-priority cgroup's performance:
>  
>   cgroup    bogo ops/s
>   low           14,757
>  
> The low-priority cgroup remains significantly throttled despite the
> high-priority cgroup being effectively idle -- a clear limitation of
> static memory.low control.
>  
> Test 3: memcg eBPF -- dynamic priority control:
> memcg is a sample program introduced in this patch series
> (samples/bpf/memcg.c + memcg.bpf.c). It loads a BPF program that
> monitors PGFAULT events in the high-priority cgroup. When the
> per-second fault count exceeds a configured threshold, the hook
> activates below_min protection for one second; otherwise the cgroup
> receives no special treatment.
>  
>   # ./memcg --low_path=/sys/fs/cgroup/low  \
>             --high_path=/sys/fs/cgroup/high \
>             --threshold=1 --use_below_min
>   Successfully attached!
>  
> 3a. Both cgroups under active memory pressure:
>  
> When both cgroups run stress-ng, the high-priority cgroup generates
> frequent page faults and the BPF hook activates protection, matching
> the behavior of memory.low:
>  
>   cgroup    bogo ops/s
>   high         404,392
>   low           11,404
>  
> 3b. High-priority cgroup is idle (Python + sleep):
>  
> Because the sleeping Python process generates no page faults, the BPF
> hook never activates, and the low-priority cgroup is free to reclaim
> memory normally:
>  
>   cgroup    bogo ops/s
>   low          551,083
>  
> This is a ~37x improvement over the equivalent memory.low scenario
> (Test 2), demonstrating that eBPF-driven dynamic control can
> accurately reflect actual workload activity and avoid unnecessary
> protection of idle high-priority tasks.
>  
> Summary:
>   Scenario                          low-cgroup bogo ops/s
>   Baseline (no policy)                           ~4,927
>   memory.low, both active                       ~11,307
>   memory.low, high idle                         ~14,757
>   memcg eBPF, both active                       ~11,404
>   memcg eBPF, high idle                        ~551,083
>  
> References:
> [1] https://patchew.org/linux/20260127024421.494929-1-roman.gushchin@linux.dev/
> 
> Changelog:
> v7:
> Change base commits of "mm: BPF OOM" to v3.
> Some fixes according to the comments of bpf-ci.
> Rename get_high_delay_ms hook to memcg_charged; add memcg_uncharged
> hook for tracking uncharge events.
> Update below_low and below_min hooks to receive elow/emin and usage
> as explicit arguments.
> Add bpf_try_to_free_mem_cgroup_pages kfunc to expose cgroup reclaim
> to BPF programs.
> Add selftest for BPF-driven asynchronous page reclaim.
> Extend samples/bpf/memcg to support async reclaim in addition to
> priority throttling.
> v6:
> Based on the bot+bof-ci comments, fixed the following issues.
> Added fast-path check with unlikely() before SRCU lock acquisition to
> optimize the no-BPF case in BPF_MEMCG_CALL.
> Add missing newline in pr_warn message to bpf_memcontrol_init.
> Added comprehensive child process exit status checking with WIFEXITED()
> and WEXITSTATUS(), and added zombie process prevention in
> real_test_memcg_ops.
> Changed malloc() to calloc() for BSS data allocation in all test
> functions and samples main function.
> Change srcu_read_lock(&memcg_bpf_srcu) to
> lockdep_assert_held(&cgroup_mutex) in function memcontrol_bpf_online
> and memcontrol_bpf_offline.
> v5:
> Based on the bot+bof-ci comments, fixed the following issues.
> Fixed issues in memcg_ops.c and memcg.bpf.c by moving variable
> declaration to the beginning of need_threshold() function.
> The 'u64 current_ts' variable must be declared before any
> executable statements
> Improved input validation in samples/bpf/memcg.c by adding a new
> parse_u64() helper function. This function properly handles errors
> from strtoull() and provides better error messages when parsing
> threshold and over_high_ms command-line arguments.
> Move check for prog->sleepable after validating member offsets in
> mm/bpf_memcontrol.c bpf_memcg_ops_check_member.
> Fixed sscanf return value checking in prog_tests/memcg_ops.c.
> Changed the condition from 'sscanf() < 0' to 'sscanf() != 1' because
> sscanf returns the number of successfully matched items, not a negative
> value on error. This makes the test more reliable when reading timing
> data from temporary files.
> v4:
> Fix the issues according to the comments from bot+bof-ci.
> According to JP Kobryn's comments, move exit(0) from
> real_test_memcg_ops_child_work to real_test_memcg_ops.
> Fix issues in the bpf_memcg_ops_reg function.
> v3:
> According to the comments from Michal Koutný and Chen Ridong, update hooks
> to get_high_delay_ms, below_low, below_min, handle_cgroup_online, and
> handle_cgroup_offline.
> According to Michal Koutný's comments, add BPF_F_ALLOW_OVERRIDE
> support to memcg_bpf_ops.
> v2:
> According to Tejun Heo's comments, rebased on Roman Gushchin's BPF
> OOM patch series [1] and added hierarchical delegation support.
> According to the comments from Roman Gushchin and Michal Hocko, designed
> concrete use case scenarios and provided test results.
> 
> Hui Zhu (7):
>   bpf: Pass flags in bpf_link_create for struct_ops
>   mm: memcontrol: Add BPF struct_ops for memory controller
>   mm/bpf: Add bpf_try_to_free_mem_cgroup_pages kfunc
>   selftests/bpf: Add tests for memcg_bpf_ops
>   selftests/bpf: Add test for memcg_bpf_ops hierarchies
>   selftests/bpf: Add selftest for memcg async reclaim via BPF
>   samples/bpf: Add memcg priority control and async reclaim example
> 
> Roman Gushchin (4):
>   bpf: move bpf_struct_ops_link into bpf.h
>   bpf: allow attaching struct_ops to cgroups
>   libbpf: fix return value on memory allocation failure
>   libbpf: introduce bpf_map__attach_struct_ops_opts()
> 
>  MAINTAINERS                                   |   6 +
>  include/linux/bpf-cgroup-defs.h               |   3 +
>  include/linux/bpf-cgroup.h                    |  16 +
>  include/linux/bpf.h                           |  10 +
>  include/linux/memcontrol.h                    | 250 ++++++-
>  include/uapi/linux/bpf.h                      |   5 +-
>  kernel/bpf/bpf_struct_ops.c                   |  67 +-
>  kernel/bpf/cgroup.c                           |  46 ++
>  mm/bpf_memcontrol.c                           | 355 +++++++++-
>  mm/memcontrol.c                               |  43 +-
>  samples/bpf/.gitignore                        |   1 +
>  samples/bpf/Makefile                          |   8 +-
>  samples/bpf/memcg.bpf.c                       | 380 +++++++++++
>  samples/bpf/memcg.c                           | 411 ++++++++++++
>  tools/include/uapi/linux/bpf.h                |   3 +-
>  tools/lib/bpf/libbpf.c                        |  22 +-
>  tools/lib/bpf/libbpf.h                        |  14 +
>  tools/lib/bpf/libbpf.map                      |   1 +
>  tools/testing/selftests/bpf/cgroup_helpers.c  |  41 ++
>  tools/testing/selftests/bpf/cgroup_helpers.h  |   2 +
>  .../bpf/prog_tests/memcg_async_reclaim.c      | 333 +++++++++
>  .../selftests/bpf/prog_tests/memcg_ops.c      | 634 ++++++++++++++++++
>  .../selftests/bpf/progs/memcg_async_reclaim.c | 203 ++++++
>  tools/testing/selftests/bpf/progs/memcg_ops.c | 132 ++++
>  24 files changed, 2952 insertions(+), 34 deletions(-)
>  create mode 100644 samples/bpf/memcg.bpf.c
>  create mode 100644 samples/bpf/memcg.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/memcg_async_reclaim.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/memcg_ops.c
>  create mode 100644 tools/testing/selftests/bpf/progs/memcg_async_reclaim.c
>  create mode 100644 tools/testing/selftests/bpf/progs/memcg_ops.c
> 
> -- 
> 2.43.0
> 
> 

      parent reply	other threads:[~2026-05-26 13:41 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-26  2:20 [RFC PATCH bpf-next v7 00/11] mm: BPF struct_ops for dynamic memory protection and async reclaim Hui Zhu
2026-05-26  2:20 ` [RFC PATCH bpf-next v7 01/11] bpf: move bpf_struct_ops_link into bpf.h Hui Zhu
2026-05-26  2:20 ` [RFC PATCH bpf-next v7 02/11] bpf: allow attaching struct_ops to cgroups Hui Zhu
2026-05-26  3:19   ` bot+bpf-ci
2026-05-26  2:20 ` [RFC PATCH bpf-next v7 03/11] libbpf: fix return value on memory allocation failure Hui Zhu
2026-05-26  3:06   ` bot+bpf-ci
2026-05-26  2:20 ` [RFC PATCH bpf-next v7 04/11] libbpf: introduce bpf_map__attach_struct_ops_opts() Hui Zhu
2026-05-26  3:06   ` bot+bpf-ci
2026-05-26  2:20 ` [RFC PATCH bpf-next v7 05/11] bpf: Pass flags in bpf_link_create for struct_ops Hui Zhu
2026-05-26  2:24 ` [RFC PATCH bpf-next v7 06/11] mm: memcontrol: Add BPF struct_ops for memory controller Hui Zhu
2026-05-26  3:19   ` bot+bpf-ci
2026-05-26  2:24 ` [RFC PATCH bpf-next v7 07/11] mm/bpf: Add bpf_try_to_free_mem_cgroup_pages kfunc Hui Zhu
2026-05-26  3:06   ` bot+bpf-ci
2026-05-26  2:24 ` [RFC PATCH bpf-next v7 08/11] selftests/bpf: Add tests for memcg_bpf_ops Hui Zhu
2026-05-26  2:27 ` [RFC PATCH bpf-next v7 09/11] selftests/bpf: Add test for memcg_bpf_ops hierarchies Hui Zhu
2026-05-26  2:27 ` [RFC PATCH bpf-next v7 10/11] selftests/bpf: Add selftest for memcg async reclaim via BPF Hui Zhu
2026-05-26  3:06   ` bot+bpf-ci
2026-05-26  2:27 ` [RFC PATCH bpf-next v7 11/11] samples/bpf: Add memcg priority control and async reclaim example Hui Zhu
2026-05-26 13:41 ` Usama Arif [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260526134115.816081-1-usama.arif@linux.dev \
    --to=usama.arif@linux.dev \
    --cc=a.s.protopopov@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=ameryhung@gmail.com \
    --cc=andrii@kernel.org \
    --cc=baohua@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=brauner@kernel.org \
    --cc=brgerst@gmail.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chen.dylane@linux.dev \
    --cc=chenridong@huaweicloud.com \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=eddyz87@gmail.com \
    --cc=eyal.birger@gmail.com \
    --cc=geliang@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=haoluo@google.com \
    --cc=hawk@kernel.org \
    --cc=hui.zhu@linux.dev \
    --cc=inwardvessel@gmail.com \
    --cc=jeffxu@chromium.org \
    --cc=jiayuan.chen@linux.dev \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kees@kernel.org \
    --cc=kernel@jfarr.cc \
    --cc=kerneljasonxing@gmail.com \
    --cc=kpsingh@kernel.org \
    --cc=kuba@kernel.org \
    --cc=lance.yang@linux.dev \
    --cc=leon.hwang@linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=martin.lau@linux.dev \
    --cc=masahiroy@kernel.org \
    --cc=memxor@gmail.com \
    --cc=mhocko@kernel.org \
    --cc=mkoutny@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=nathan@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=ojeda@kernel.org \
    --cc=paul.chaignon@gmail.com \
    --cc=peterz@infradead.org \
    --cc=rdunlap@infradead.org \
    --cc=roman.gushchin@linux.dev \
    --cc=rongtao@cestc.cn \
    --cc=sdf@fomichev.me \
    --cc=shakeel.butt@linux.dev \
    --cc=shuah@kernel.org \
    --cc=song@kernel.org \
    --cc=tj@kernel.org \
    --cc=tklauser@distanz.ch \
    --cc=willemb@google.com \
    --cc=yatsenko@meta.com \
    --cc=yonghong.song@linux.dev \
    --cc=zhuhui@kylinos.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox