From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 05A18CD5BC8 for ; Tue, 26 May 2026 13:41:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5121B6B0088; Tue, 26 May 2026 09:41:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4EA0F6B0093; Tue, 26 May 2026 09:41:29 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 402406B0095; Tue, 26 May 2026 09:41:29 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 2C8296B0088 for ; Tue, 26 May 2026 09:41:29 -0400 (EDT) Received: from smtpin01.hostedemail.com (lb01a-stub [10.200.18.249]) by unirelay01.hostedemail.com (Postfix) with ESMTP id D26681C12D8 for ; Tue, 26 May 2026 13:41:28 +0000 (UTC) X-FDA: 84809683056.01.3F2BD70 Received: from out-181.mta0.migadu.com (out-181.mta0.migadu.com [91.218.175.181]) by imf05.hostedemail.com (Postfix) with ESMTP id D80BE100006 for ; Tue, 26 May 2026 13:41:26 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=d76OzVGL; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf05.hostedemail.com: domain of usama.arif@linux.dev designates 91.218.175.181 as permitted sender) smtp.mailfrom=usama.arif@linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1779802887; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=KyRENNPrPzM2/fviijUVJk5HSnNAssJF903Rhb6e2Mk=; b=3EXBYa4tC/YIKdpFxpXeVU4CnD/fnETZsLQcK1DnoYyPq32mWxkn94I/WkxgMdv0tN39MM 9VtNB4Qz2xfP9QzDHQxp3HZXgpsefpn89rpjF+YROBcALf5lvEilDKf494aejVbOousKNO oaHhyLDuBx3w+S9e5+hxwDc/F0DO+Vw= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=d76OzVGL; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf05.hostedemail.com: domain of usama.arif@linux.dev designates 91.218.175.181 as permitted sender) smtp.mailfrom=usama.arif@linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1779802887; a=rsa-sha256; cv=none; b=GBNPsy1Pt5d+AXRnnJZ3KmcGUV5X5tTWbm2A7CINglq+ekfQ5iyH9/AIPArxoBFH16v+En 4Y5ejfbrzbkS+a/yEykqu8og7OW4M/OvMZkbbQnQC+iM20iQzy44sTLy2Gzhom0ln6V5h+ c5llucNRV3WPs+3pIS/VsKRKZqv/cMk= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1779802884; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=KyRENNPrPzM2/fviijUVJk5HSnNAssJF903Rhb6e2Mk=; b=d76OzVGLJbU1zoCXJ9bJ7R1se++18xtgLejNzeRe7oQIA2LnZrQ4p7gIay+GqLUzIifEaG 9JtVBNrLmQ3WKMZoGW03GH35B6F7t1T2y2WJikUpv/2ISyIARvFjJB6aRTC/3JJyGkK6Kh ZwadHb/Op2Jf4nWQcnYEkmHtPpnNotI= From: Usama Arif To: Hui Zhu Cc: Usama Arif , Daniel Borkmann , John Fastabend , Andrii Nakryiko , Martin KaFai Lau , Eduard Zingerman , Kumar Kartikeya Dwivedi , Song Liu , Yonghong Song , Jiri Olsa , Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , JP Kobryn , Andrew Morton , Shuah Khan , davem@davemloft.net, Jakub Kicinski , Jesper Dangaard Brouer , Stanislav Fomichev , KP Singh , Tao Chen , Mykyta Yatsenko , Leon Hwang , Anton Protopopov , Amery Hung , Tobias Klauser , Eyal Birger , Rong Tao , Hao Luo , Peter Zijlstra , Miguel Ojeda , Nathan Chancellor , Kees Cook , Tejun Heo , Jeff Xu , mkoutny@suse.com, Jan Hendrik Farr , Christian Brauner , Randy Dunlap , Brian Gerst , Masahiro Yamada , Willem de Bruijn , Jason Xing , Paul Chaignon , Chen Ridong , Lance Yang , Jiayuan Chen , linux-kernel@vger.kernel.org, bpf@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org, netdev@vger.kernel.org, linux-kselftest@vger.kernel.org, geliang@kernel.org, baohua@kernel.org, Hui Zhu Subject: Re: [RFC PATCH bpf-next v7 00/11] mm: BPF struct_ops for dynamic memory protection and async reclaim Date: Tue, 26 May 2026 06:41:02 -0700 Message-ID: <20260526134115.816081-1-usama.arif@linux.dev> In-Reply-To: References: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: D80BE100006 X-Stat-Signature: kne9q3scbnj47pzn6cpgqktcgciam34d X-Rspam-User: X-HE-Tag: 1779802886-397843 X-HE-Meta: U2FsdGVkX19CaJfPfhpg7lnF0EodW14wRrdX6xWf4qevXVSaYJLFvK/b7qhmqDs4QF7QRdftUIdMYlZx9YUx7KdhY51EsMda0nevgJJztqVQmHsBsZ6HHPhkgvJ+vI9QxcCKusQhMCN1+JKmqKOgKppQGg4HSQIduKxaqtn8fbYr0vfx/N8sjAP8B4GuwcbUOBsZO2W8xJ6FjYLhCtJrROOx8EwONVQ7CaU/n+XBRk8wl9wk9bypV6UeJkzX6JkMUKlhxaATp3omgxiyxjZ/o88iCbKSHZ8gYN/r3HrEkNP3EfphtZsWmtsDBFj5a/uP488/GtEqq39FSPxJVZBkRBhfKO65OZOh5pTogq0ZOn/E6axvB1+scu5+14Qm+mpbWvLs1X9i1tKa5jp1x23FwsjV0D/R7K2rGUoNvKNgic7vrtvPHi4SKXKDTRffThSRBSur49sct2tlw4dEXBJwIMGACjJvGSs9hiyFquB8PGVwp1sciJj1ySMfFzq5Mar5dOuf/X1KJfNVuc2tXUVQlk3G9xYMrJJ7PNvA6QaLIY+E7ZepOHNo9sZ8ZlpcDb5R6ikFeQQx7e1L7s3XrVzBUdkflXBs5MY6UXGpQYMKxCn3oa+bXTtC44DkKuzENjCg435dASurqbOrxxfjeW8JkVoPht4I3ypLYa7xWTT4nUkirtwCfCCExU1/XxDpxEJm4j5MtFVVLVsAp3eo3d9rRkvbIdFlrrc1Aub6po5TWVtca1T12kpnUEGYRAlFAV70p5ap5Wt/2lg/Qfx+vm/qfNr3zxXgLfHqeyi5LluJmln7wyfyV84sAnY8wXKtLjvjjyLW9+FktzGYW46XApmGtPaZT3oJ3zYHF4or5ho9nkF+ISSzfzD3i6ju05N7/j0DL/t0d6oeM0BmmEGJihzD+ORQz/GvnR4Lc7A5Qrn+WbThf4qGPCRLC56riKuMtTcs8D4uuLVZeOdgyoDhIZa 6mQcTnZY PV0P6HtnNimJ06K0H09/gjAGJS8ExFAwer25P4AuuljnzrSflZvkFEsY9o1sE10mc39e6XSACyEKc7OPsm4oxlcxEOFSLY6qUgSDEYSxjzGBWn8xR+2vZ0IwwCmVIl6DSXQiO/1+fwy9YOE7okoyF2dd6M1JPC5kffwwiCsrZExbHCvRuyWVoPVWHvg+u1PhEI9td4FcIG/ahXQurttEqrZnksbVvhcQ++oo+bg/2cZmQNiEafu6L7wKhB3rj6IPwfRujEnGksa7C9g3PqWHR9oXxZ3iaX8z8M6q/goKo1tIfUMsaidbtphiaApYE83R8ki6M8WD4v7Wu5iEG7SasJ5qA8GPbov3lfvOCPga4PBOxx5cpwmYjT5wUtdFjz9Gv6RaFLj8gtU2uSnGdFFr5cJBu5JgdEeFGkPW9 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, 26 May 2026 10:20:00 +0800 Hui Zhu wrote: > From: Hui Zhu > > Overview: > This series introduces BPF struct_ops support for the memory controller, > enabling userspace BPF programs to implement custom, dynamic memory > management policies per cgroup. The feature allows BPF programs to hook > into the core reclaim and charge paths without requiring kernel > modifications, providing a flexible alternative to static knobs such as > memory.low and memory.min. > > The series enables two complementary use cases. > > Dynamic memory protection: static memory protection thresholds > (memory.low, memory.min) are poor fits for workloads whose actual memory > activity varies over time. A high-priority cgroup holding a large working > set but temporarily idle will still suppress reclaim on its siblings, > wasting available memory. A BPF-driven approach can observe real workload > activity -- page faults, charge/uncharge events -- and activate or > withdraw protection dynamically. The test results at the end of this > letter quantify the difference: in a scenario where the high-priority > cgroup is idle, the BPF-controlled low-priority cgroup achieves roughly > 37x higher throughput than with static memory.low. > > Asynchronous proactive reclaim: the memcg_charged and memcg_uncharged > hooks, combined with the BPF workqueue mechanism and the new > bpf_try_to_free_mem_cgroup_pages() kfunc, enable BPF programs to perform > proactive background reclaim without blocking the charge path. The > pattern works as follows: the memcg_charged callback tracks accumulated > memory usage; when usage crosses a configurable threshold, it enqueues an > asynchronous work item via bpf_wq_start() and returns immediately without > throttling the charging task. The workqueue callback then invokes > bpf_try_to_free_mem_cgroup_pages() to reclaim pages from the target > cgroup; if usage remains elevated after reclaim, the callback re-enqueues > itself to continue. This allows a BPF program to keep a cgroup's > footprint below its hard limit (memory.max) entirely in the background, > avoiding the OOM killer or direct-reclaim stalls that would otherwise > occur. The selftest for this feature (patch 10/11) validates the > mechanism concretely: a workload that writes and mmaps a 64 MB file inside > a 32 MB cgroup reliably triggers memory.events "max" events without BPF; > with the async reclaim program attached, the "max" counter does not > increase at all across the same workload. > Hi Hui, Thanks for the series. Would it not be simpler to just have another memcg knob, something like memory.high_async. When memory usage > memory.high_async, queue a per-memcg work item that calls try_to_free_mem_cgroup_pages() until usage drops back below some threshold. I am not sure I see what programability aspect from bpf you need here. Thanks > > 08/11 selftests/bpf: Add tests for memcg_bpf_ops > Adds prog_tests/memcg_ops.c covering three scenarios: > memcg_charged-only throttling, below_low + memcg_charged > interaction, and below_min + memcg_charged interaction. A > tracepoint on memcg:count_memcg_events (PGFAULT) is used to > detect memory pressure and trigger hooks accordingly. > > 09/11 selftests/bpf: Add test for memcg_bpf_ops hierarchies > Validates BPF_F_ALLOW_OVERRIDE attachment semantics across a > three-level cgroup hierarchy: attach with ALLOW_OVERRIDE at the > root, override at the middle level without the flag, then assert > that attaching to the leaf correctly fails with -EBUSY. > > 10/11 selftests/bpf: Add selftest for memcg async reclaim via BPF > Demonstrates and validates asynchronous memory reclaim: a BPF > program uses the memcg_charged/memcg_uncharged hooks to track > accumulated usage and, when a threshold is exceeded, enqueues a > bpf_wq_start() workqueue item that calls > bpf_try_to_free_mem_cgroup_pages() without blocking the charge > path. The test asserts that with the BPF program active, > memory.events "max" events do not increase under a workload > that would otherwise exceed the hard limit. > > 11/11 samples/bpf: Add memcg priority control and async reclaim example > Adds a complete sample (samples/bpf/memcg.bpf.c + memcg.c) > demonstrating both features. The BPF side monitors PGFAULT > events on a high-priority cgroup; when the per-second fault > count crosses a configurable threshold, it activates below_low > or below_min protection for the high-priority cgroup and/or > applies a charge delay to the low-priority cgroup. Six > struct_ops variants are exported so userspace can attach only > the hooks needed. Async reclaim is optionally combined with > priority throttling via a shared low-cgroup ops map. > > Test Environment: > The following examples run on x86_64 QEMU (10 CPUs, 2 GB RAM), using > a tmpfs-backed file on the host as a swap device to reduce I/O impact. > Two cgroups are created -- high (high-priority) and low (low-priority) > -- and each test runs two concurrent stress-ng workloads, one per > cgroup, each requesting 3 GB of memory. > > # mkdir /sys/fs/cgroup/high /sys/fs/cgroup/low > # free -h > total used free shared buff/cache available > Mem: 1.9Gi 317Mi 1.6Gi 1.0Mi 144Mi 1.6Gi > Swap: 4.0Gi 0B 4.0Gi > > Baseline: no memory priority policy: > Both cgroups run without any reclaim protection. Results are roughly > equal, as expected: > > cgroup bogo ops/s > high 4,979 > low 4,927 > > Test 1: memory.low protection: > Setting memory.low on the high-priority cgroup protects it from > reclaim, at the cost of pushing reclaim pressure onto the low-priority > cgroup: > > # echo $((3 * 1024 * 1024 * 1024)) > /sys/fs/cgroup/high/memory.low > > cgroup bogo ops/s > high 450,290 > low 11,307 > > The high-priority cgroup benefits significantly, but memory.low relies > on static usage thresholds and cannot adapt to actual workload > behavior. > > Test 2: memory.low with an idle high-priority task: > Here the high-priority cgroup runs a Python script that allocates 3 GB > and then sleeps, simulating a low-activity but memory-holding workload. > Because the process is idle, it generates no page faults and does not > actively use its memory. Yet memory.low still protects it, continuing > to suppress the low-priority cgroup's performance: > > cgroup bogo ops/s > low 14,757 > > The low-priority cgroup remains significantly throttled despite the > high-priority cgroup being effectively idle -- a clear limitation of > static memory.low control. > > Test 3: memcg eBPF -- dynamic priority control: > memcg is a sample program introduced in this patch series > (samples/bpf/memcg.c + memcg.bpf.c). It loads a BPF program that > monitors PGFAULT events in the high-priority cgroup. When the > per-second fault count exceeds a configured threshold, the hook > activates below_min protection for one second; otherwise the cgroup > receives no special treatment. > > # ./memcg --low_path=/sys/fs/cgroup/low \ > --high_path=/sys/fs/cgroup/high \ > --threshold=1 --use_below_min > Successfully attached! > > 3a. Both cgroups under active memory pressure: > > When both cgroups run stress-ng, the high-priority cgroup generates > frequent page faults and the BPF hook activates protection, matching > the behavior of memory.low: > > cgroup bogo ops/s > high 404,392 > low 11,404 > > 3b. High-priority cgroup is idle (Python + sleep): > > Because the sleeping Python process generates no page faults, the BPF > hook never activates, and the low-priority cgroup is free to reclaim > memory normally: > > cgroup bogo ops/s > low 551,083 > > This is a ~37x improvement over the equivalent memory.low scenario > (Test 2), demonstrating that eBPF-driven dynamic control can > accurately reflect actual workload activity and avoid unnecessary > protection of idle high-priority tasks. > > Summary: > Scenario low-cgroup bogo ops/s > Baseline (no policy) ~4,927 > memory.low, both active ~11,307 > memory.low, high idle ~14,757 > memcg eBPF, both active ~11,404 > memcg eBPF, high idle ~551,083 > > References: > [1] https://patchew.org/linux/20260127024421.494929-1-roman.gushchin@linux.dev/ > > Changelog: > v7: > Change base commits of "mm: BPF OOM" to v3. > Some fixes according to the comments of bpf-ci. > Rename get_high_delay_ms hook to memcg_charged; add memcg_uncharged > hook for tracking uncharge events. > Update below_low and below_min hooks to receive elow/emin and usage > as explicit arguments. > Add bpf_try_to_free_mem_cgroup_pages kfunc to expose cgroup reclaim > to BPF programs. > Add selftest for BPF-driven asynchronous page reclaim. > Extend samples/bpf/memcg to support async reclaim in addition to > priority throttling. > v6: > Based on the bot+bof-ci comments, fixed the following issues. > Added fast-path check with unlikely() before SRCU lock acquisition to > optimize the no-BPF case in BPF_MEMCG_CALL. > Add missing newline in pr_warn message to bpf_memcontrol_init. > Added comprehensive child process exit status checking with WIFEXITED() > and WEXITSTATUS(), and added zombie process prevention in > real_test_memcg_ops. > Changed malloc() to calloc() for BSS data allocation in all test > functions and samples main function. > Change srcu_read_lock(&memcg_bpf_srcu) to > lockdep_assert_held(&cgroup_mutex) in function memcontrol_bpf_online > and memcontrol_bpf_offline. > v5: > Based on the bot+bof-ci comments, fixed the following issues. > Fixed issues in memcg_ops.c and memcg.bpf.c by moving variable > declaration to the beginning of need_threshold() function. > The 'u64 current_ts' variable must be declared before any > executable statements > Improved input validation in samples/bpf/memcg.c by adding a new > parse_u64() helper function. This function properly handles errors > from strtoull() and provides better error messages when parsing > threshold and over_high_ms command-line arguments. > Move check for prog->sleepable after validating member offsets in > mm/bpf_memcontrol.c bpf_memcg_ops_check_member. > Fixed sscanf return value checking in prog_tests/memcg_ops.c. > Changed the condition from 'sscanf() < 0' to 'sscanf() != 1' because > sscanf returns the number of successfully matched items, not a negative > value on error. This makes the test more reliable when reading timing > data from temporary files. > v4: > Fix the issues according to the comments from bot+bof-ci. > According to JP Kobryn's comments, move exit(0) from > real_test_memcg_ops_child_work to real_test_memcg_ops. > Fix issues in the bpf_memcg_ops_reg function. > v3: > According to the comments from Michal Koutný and Chen Ridong, update hooks > to get_high_delay_ms, below_low, below_min, handle_cgroup_online, and > handle_cgroup_offline. > According to Michal Koutný's comments, add BPF_F_ALLOW_OVERRIDE > support to memcg_bpf_ops. > v2: > According to Tejun Heo's comments, rebased on Roman Gushchin's BPF > OOM patch series [1] and added hierarchical delegation support. > According to the comments from Roman Gushchin and Michal Hocko, designed > concrete use case scenarios and provided test results. > > Hui Zhu (7): > bpf: Pass flags in bpf_link_create for struct_ops > mm: memcontrol: Add BPF struct_ops for memory controller > mm/bpf: Add bpf_try_to_free_mem_cgroup_pages kfunc > selftests/bpf: Add tests for memcg_bpf_ops > selftests/bpf: Add test for memcg_bpf_ops hierarchies > selftests/bpf: Add selftest for memcg async reclaim via BPF > samples/bpf: Add memcg priority control and async reclaim example > > Roman Gushchin (4): > bpf: move bpf_struct_ops_link into bpf.h > bpf: allow attaching struct_ops to cgroups > libbpf: fix return value on memory allocation failure > libbpf: introduce bpf_map__attach_struct_ops_opts() > > MAINTAINERS | 6 + > include/linux/bpf-cgroup-defs.h | 3 + > include/linux/bpf-cgroup.h | 16 + > include/linux/bpf.h | 10 + > include/linux/memcontrol.h | 250 ++++++- > include/uapi/linux/bpf.h | 5 +- > kernel/bpf/bpf_struct_ops.c | 67 +- > kernel/bpf/cgroup.c | 46 ++ > mm/bpf_memcontrol.c | 355 +++++++++- > mm/memcontrol.c | 43 +- > samples/bpf/.gitignore | 1 + > samples/bpf/Makefile | 8 +- > samples/bpf/memcg.bpf.c | 380 +++++++++++ > samples/bpf/memcg.c | 411 ++++++++++++ > tools/include/uapi/linux/bpf.h | 3 +- > tools/lib/bpf/libbpf.c | 22 +- > tools/lib/bpf/libbpf.h | 14 + > tools/lib/bpf/libbpf.map | 1 + > tools/testing/selftests/bpf/cgroup_helpers.c | 41 ++ > tools/testing/selftests/bpf/cgroup_helpers.h | 2 + > .../bpf/prog_tests/memcg_async_reclaim.c | 333 +++++++++ > .../selftests/bpf/prog_tests/memcg_ops.c | 634 ++++++++++++++++++ > .../selftests/bpf/progs/memcg_async_reclaim.c | 203 ++++++ > tools/testing/selftests/bpf/progs/memcg_ops.c | 132 ++++ > 24 files changed, 2952 insertions(+), 34 deletions(-) > create mode 100644 samples/bpf/memcg.bpf.c > create mode 100644 samples/bpf/memcg.c > create mode 100644 tools/testing/selftests/bpf/prog_tests/memcg_async_reclaim.c > create mode 100644 tools/testing/selftests/bpf/prog_tests/memcg_ops.c > create mode 100644 tools/testing/selftests/bpf/progs/memcg_async_reclaim.c > create mode 100644 tools/testing/selftests/bpf/progs/memcg_ops.c > > -- > 2.43.0 > >