From: Hui Zhu <hui.zhu@linux.dev>
To: Andrew Morton <akpm@linux-foundation.org>,
Johannes Weiner <hannes@cmpxchg.org>,
Michal Hocko <mhocko@kernel.org>,
Roman Gushchin <roman.gushchin@linux.dev>,
Shakeel Butt <shakeel.butt@linux.dev>,
Muchun Song <muchun.song@linux.dev>,
Alexei Starovoitov <ast@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Andrii Nakryiko <andrii@kernel.org>,
Martin KaFai Lau <martin.lau@linux.dev>,
Eduard Zingerman <eddyz87@gmail.com>, Song Liu <song@kernel.org>,
Yonghong Song <yonghong.song@linux.dev>,
John Fastabend <john.fastabend@gmail.com>,
KP Singh <kpsingh@kernel.org>,
Stanislav Fomichev <sdf@fomichev.me>, Hao Luo <haoluo@google.com>,
Jiri Olsa <jolsa@kernel.org>, Shuah Khan <shuah@kernel.org>,
Peter Zijlstra <peterz@infradead.org>,
Miguel Ojeda <ojeda@kernel.org>,
Nathan Chancellor <nathan@kernel.org>,
Kees Cook <kees@kernel.org>, Tejun Heo <tj@kernel.org>,
Jeff Xu <jeffxu@chromium.org>,
mkoutny@suse.com, Jan Hendrik Farr <kernel@jfarr.cc>,
Christian Brauner <brauner@kernel.org>,
Randy Dunlap <rdunlap@infradead.org>,
Brian Gerst <brgerst@gmail.com>,
Masahiro Yamada <masahiroy@kernel.org>,
davem@davemloft.net, Jakub Kicinski <kuba@kernel.org>,
Jesper Dangaard Brouer <hawk@kernel.org>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
cgroups@vger.kernel.org, bpf@vger.kernel.org,
linux-kselftest@vger.kernel.org
Cc: Hui Zhu <zhuhui@kylinos.cn>
Subject: [RFC PATCH v2 0/3] Memory Controller eBPF support
Date: Tue, 30 Dec 2025 11:01:58 +0800 [thread overview]
Message-ID: <cover.1767012332.git.zhuhui@kylinos.cn> (raw)
From: Hui Zhu <zhuhui@kylinos.cn>
This series adds BPF struct_ops support to the memory controller,
enabling dynamic control over memory pressure through the
memcg_nr_pages_over_high mechanism. This allows administrators to
suppress low-priority cgroups' memory usage based on custom
policies implemented in BPF programs.
Background and Motivation
The memory controller provides memory.high limits to throttle
cgroups exceeding their soft limit. However, the current
implementation applies the same policy across all cgroups
without considering priority or workload characteristics.
This series introduces a BPF hook that allows reporting
additional "pages over high" for specific cgroups, effectively
increasing memory pressure and throttling for lower-priority
workloads when higher-priority cgroups need resources.
Use Case: Priority-Based Memory Management
Consider a system running both latency-sensitive services and
batch processing workloads. When the high-priority service
experiences memory pressure (detected via page scan events),
the BPF program can artificially inflate the "over high" count
for low-priority cgroups, causing them to be throttled more
aggressively and freeing up memory for the critical workload.
Implementation
This series builds upon Roman Gushchin's BPF OOM patch series in [1].
The implementation adds:
1. A memcg_bpf_ops struct_ops type with memcg_nr_pages_over_high
hook
2. Integration into memory pressure calculation paths
3. Cgroup hierarchy management (inheritance during online/offline)
4. SRCU protection for safe concurrent access
Why Not PSI?
This implementation does not use PSI for triggering, as discussed
in [2].
Instead, the sample code monitors PGSCAN events via tracepoints,
which provides more direct feedback on memory pressure.
Example Results
Testing on x86_64 QEMU (10 CPU, 4GB RAM, cache=none swap):
root@ubuntu:~# cat /proc/sys/vm/swappiness
60
root@ubuntu:~# mkdir /sys/fs/cgroup/high
root@ubuntu:~# mkdir /sys/fs/cgroup/low
root@ubuntu:~# ./memcg /sys/fs/cgroup/low /sys/fs/cgroup/high 100 1024
Successfully attached!
root@ubuntu:~# cgexec -g memory:low stress-ng --vm 4 --vm-keep --vm-bytes 80% \
--vm-method all --seed 2025 --metrics -t 60 \
& cgexec -g memory:high stress-ng --vm 4 --vm-keep --vm-bytes 80% \
--vm-method all --seed 2025 --metrics -t 60
[1] 1075
stress-ng: info: [1075] setting to a 1 min, 0 secs run per stressor
stress-ng: info: [1076] setting to a 1 min, 0 secs run per stressor
stress-ng: info: [1075] dispatching hogs: 4 vm
stress-ng: info: [1076] dispatching hogs: 4 vm
stress-ng: metrc: [1076] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
stress-ng: metrc: [1076] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB)
stress-ng: metrc: [1076] vm 21033377 60.47 158.04 3.66 347825.55 130076.67 66.85 834836
stress-ng: info: [1076] skipped: 0
stress-ng: info: [1076] passed: 4: vm (4)
stress-ng: info: [1076] failed: 0
stress-ng: info: [1076] metrics untrustworthy: 0
stress-ng: info: [1076] successful run completed in 1 min, 0.72 secs
root@ubuntu:~# stress-ng: metrc: [1075] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
stress-ng: metrc: [1075] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB)
stress-ng: metrc: [1075] vm 11568 65.05 0.00 0.21 177.83 56123.74 0.08 3200
stress-ng: info: [1075] skipped: 0
stress-ng: info: [1075] passed: 4: vm (4)
stress-ng: info: [1075] failed: 0
stress-ng: info: [1075] metrics untrustworthy: 0
stress-ng: info: [1075] successful run completed in 1 min, 5.06 secs
Results show the low-priority cgroup (/sys/fs/cgroup/low) was
significantly throttled:
- High-priority cgroup: 21,033,377 bogo ops at 347,825 ops/s
- Low-priority cgroup: 11,568 bogo ops at 177 ops/s
The stress-ng process in the low-priority cgroup experienced a
~99.9% slowdown in memory operations compared to the
high-priority cgroup, demonstrating effective priority
enforcement through BPF-controlled memory pressure.
Patch Overview
PATCH 1/3: Core kernel implementation
- Adds memcg_bpf_ops struct_ops support
- Implements cgroup lifecycle management
- Integrates hook into pressure calculation
PATCH 2/3: Selftest suite
- Validates attach/detach behavior
- Tests hierarchy inheritance
- Verifies throttling effectiveness
PATCH 3/3: Sample programs
- Demonstrates PGSCAN-based triggering
- Shows priority-based throttling
- Provides reference implementation
Changelog:
v2:
According to the comments of Tejun Heo, rebased on Roman Gushchin's BPF
OOM patch series [1] and added hierarchical delegation support.
According to the comments of Roman Gushchin and Michal Hocko, Designed
concrete use case scenarios and provided test results.
[1] https://lore.kernel.org/lkml/20251027231727.472628-1-roman.gushchin@linux.dev/
[2] https://lore.kernel.org/lkml/1d9a162605a3f32ac215430131f7745488deaa34@linux.dev/
Hui Zhu (3):
mm: memcontrol: Add BPF struct_ops for memory pressure control
selftests/bpf: Add tests for memcg_bpf_ops
samples/bpf: Add memcg priority control example
MAINTAINERS | 5 +
include/linux/memcontrol.h | 2 +
mm/bpf_memcontrol.c | 241 ++++++++++++-
mm/bpf_memcontrol.h | 73 ++++
mm/memcontrol.c | 27 +-
samples/bpf/.gitignore | 1 +
samples/bpf/Makefile | 9 +-
samples/bpf/memcg.bpf.c | 95 +++++
samples/bpf/memcg.c | 204 +++++++++++
.../selftests/bpf/prog_tests/memcg_ops.c | 340 ++++++++++++++++++
.../selftests/bpf/progs/memcg_ops_over_high.c | 95 +++++
11 files changed, 1082 insertions(+), 10 deletions(-)
create mode 100644 mm/bpf_memcontrol.h
create mode 100644 samples/bpf/memcg.bpf.c
create mode 100644 samples/bpf/memcg.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/memcg_ops.c
create mode 100644 tools/testing/selftests/bpf/progs/memcg_ops_over_high.c
--
2.43.0
next reply other threads:[~2025-12-30 3:02 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-30 3:01 Hui Zhu [this message]
2025-12-30 3:01 ` [RFC PATCH v2 1/3] mm: memcontrol: Add BPF struct_ops for memory pressure control Hui Zhu
2025-12-30 3:02 ` [RFC PATCH v2 2/3] selftests/bpf: Add tests for memcg_bpf_ops Hui Zhu
2025-12-30 3:02 ` [RFC PATCH v2 3/3] samples/bpf: Add memcg priority control example Hui Zhu
2025-12-30 9:49 ` [RFC PATCH v2 0/3] Memory Controller eBPF support Michal Koutný
2025-12-30 13:24 ` Chen Ridong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cover.1767012332.git.zhuhui@kylinos.cn \
--to=hui.zhu@linux.dev \
--cc=akpm@linux-foundation.org \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=brauner@kernel.org \
--cc=brgerst@gmail.com \
--cc=cgroups@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=eddyz87@gmail.com \
--cc=hannes@cmpxchg.org \
--cc=haoluo@google.com \
--cc=hawk@kernel.org \
--cc=jeffxu@chromium.org \
--cc=john.fastabend@gmail.com \
--cc=jolsa@kernel.org \
--cc=kees@kernel.org \
--cc=kernel@jfarr.cc \
--cc=kpsingh@kernel.org \
--cc=kuba@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=martin.lau@linux.dev \
--cc=masahiroy@kernel.org \
--cc=mhocko@kernel.org \
--cc=mkoutny@suse.com \
--cc=muchun.song@linux.dev \
--cc=nathan@kernel.org \
--cc=ojeda@kernel.org \
--cc=peterz@infradead.org \
--cc=rdunlap@infradead.org \
--cc=roman.gushchin@linux.dev \
--cc=sdf@fomichev.me \
--cc=shakeel.butt@linux.dev \
--cc=shuah@kernel.org \
--cc=song@kernel.org \
--cc=tj@kernel.org \
--cc=yonghong.song@linux.dev \
--cc=zhuhui@kylinos.cn \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).