From: Vernon Yang <vernon2gm@gmail.com>
To: akpm@linux-foundation.org, david@kernel.org, ljs@kernel.org,
roman.gushchin@linux.dev, inwardvessel@gmail.com,
shakeel.butt@linux.dev, ast@kernel.org, daniel@iogearbox.net,
surenb@google.com
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
bpf@vger.kernel.org, baohua@kernel.org, lance.yang@linux.dev,
dev.jain@arm.com, Vernon Yang <yanglincheng@kylinos.cn>
Subject: [PATCH 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent
Date: Mon, 4 May 2026 00:50:20 +0800 [thread overview]
Message-ID: <20260503165024.1526680-1-vernon2gm@gmail.com> (raw)
From: Vernon Yang <yanglincheng@kylinos.cn>
Hi all,
Background
==========
As is well known, a system can simultaneously run multiple different
scenarios. However, THP is not beneficial in every scenario — it is only
most suitable for memory-intensive applications that are not sensitive
to tail latency. For example, Redis, which is sensitive to tail latency,
is not suitable for THP. But in practice, due to Redis issues, the
entire THP functionality is often turned off, preventing other scenarios
from benefiting from it.
There are also some embedded scenarios (e.g. Android) that directly use
2MB THP, where the granularity is too large. Therefore, we introduced
mTHP in v6.8, which supports multiple-size THP. In practice, however, we
still globally fix a single mTHP size and are unable to automatically
select different mTHP sizes based on different scenarios.
After testing, it was found that
- When the system has a lot of free memory, it is normal for Redis to
use mTHP. performance degradation in Redis only occurs when the system
is under high memory pressure.
- Additionally, when a large number of small-memory processes use mTHP,
memory waste is prone to occur, and performance degradation may also
happen during fast memory allocation/release.
Previously, "Cgroup-based THP control"[1] was proposed, but it had the
following issues.
- It breaks the cgroup hierarchy property.
- Add new THP knobs, making sysadmin's job more complex
Previously, "mm, bpf: BPF-MM, BPF-THP"[2] was proposed, but it had the
following issues.
- It didn't address the issue on the per-process mode.
- For global mode, the prctl(PR_SET_THP_DISABLE) has already achieved
the same objective, there is no need to add two mechanisms for the
same purpose.
- Attaching st_ops to mm_struct, the same issues that cgroup-bpf once
faced are likely to arise again, e.g. lifetime of cgroup vs bpf, dying
cgroups, wq deadlock, etc. It is recommended to use cgroup-bpf for
implementation.
- The test cases are too simplistic, lacking eBPF cases similar to real
workloads such as sched_ext.
If I miss some thing, please let me know. Thanks!
Solution
========
This series will solve all the problems mentioned above.
1. Using cgroup-bpf to customize mTHP size for different scenarios
2. Use a cgroup eBPF program to monitor all sub-cgroups. Sub-cgroups
under the same parent-cgroup adopt the same eBPF program. Only multiple
sibling-cgroups (where the parent-cgroup has no attached eBPF program)
are supported to attach multiple different eBPF programs without
breaking the hierarchy property of the cgroup.
3. Automatically select different mTHP sizes for different cgroups,
let's focus on making them truly transparent.
4. Design mthp_ext case to address real workload issues.
The main functions of the mthp_ext are as follows:
- When sub-cgroup is under high memory pressure (default, full 100ms 1s),
it will automatically fallback to using 4KB.
- When the anon+shmem memory usage of sub-cgroup falls below the minimum
memory (default 16MB), small-memory processes will automatically
fallback to using 4KB.
- Under normal conditions, when there is no memory pressure and the
anon+shmem memory usage exceeds the minimum memory, all mTHP sizes
shall be utilized by kernel.
- Monitor the root-cgroup (/sys/fs/cgroup) directory by default, with
support for specifying any cgroup directory.
Performance
===========
The below is some performance test results, testing on x86_64 machine
(AMD Ryzen9 9950X 16C32T, 32G memory, 8G zram).
NOTE: The following always/never labels indicate setting all mTHP sizes
to always/never. Detailed test script reference[4].
redis results
~~~~~~~~~~~~~
command: redis-benchmark --csv -r 3000000 -n 3000000 -d 1024 -c 16 -P 32 -t set
When cgroup memory.high=max.
| redis-noBGSAVE | always | never | always+mthp_ext |
|----------------|-------------|----------------------|----------------------|
| rps | 1410824.167 | 1210387.500 (-14.2%) | 1265659.833 (-10.3%) |
| avg_latency_ms | 0.220 | 0.259 (-17.7%) | 0.247 (-12.3%) |
| p95_latency_ms | 0.618 | 0.708 (-14.6%) | 0.676 (-9.40%) |
| p99_latency_ms | 0.687 | 0.818 (-19.1%) | 0.756 (-10.0%) |
| redis-BGSAVE | always | never | always+mthp_ext |
|----------------|-------------|----------------------|----------------------|
| rps | 1418032.127 | 1212306.873 (-14.5%) | 1261069.373 (-11.1%) |
| avg_latency_ms | 0.218 | 0.259 (-18.8%) | 0.248 (-13.8%) |
| p95_latency_ms | 0.620 | 0.714 (-15.2%) | 0.687 (-10.8%) |
| p99_latency_ms | 0.684 | 0.828 (-21.1%) | 0.756 (-10.5%) |
When cgroup memory.high=2G.
| redis-noBGSAVE | always | never | always+mthp_ext |
|----------------|-----------|-----------------------|-----------------------|
| rps | 24813.980 | 1049254.583 (4128.5%) | 1063171.270 (4184.6%) |
| avg_latency_ms | 13.317 | 0.302 ( 97.7%) | 0.298 ( 97.8%) |
| p95_latency_ms | 23.220 | 0.754 ( 96.8%) | 0.828 ( 96.4%) |
| p99_latency_ms | 369.492 | 1.154 ( 99.7%) | 1.615 ( 99.6%) |
| redis-BGSAVE | always | never | always+mthp_ext |
|----------------|-----------|-----------------------|-----------------------|
| rps | 48373.433 | 1058403.500 (2088.0%) | 1070805.707 (2113.6%) |
| avg_latency_ms | 6.884 | 0.300 ( 95.6%) | 0.296 ( 95.7%) |
| p95_latency_ms | 16.474 | 0.743 ( 95.5%) | 0.820 ( 95.0%) |
| p99_latency_ms | 326.058 | 1.170 ( 99.6%) | 1.586 ( 99.5%) |
When the redis is under no memory pressure, RPS drops by 10.3% (from 1.4M to
1.2M, Is this within the acceptable range?).
However, under high memory pressure, RPS improve by 4184.6% (from 24K to 1M),
while significantly reducing the tail latency by 99%.
unixbench results
~~~~~~~~~~~~~~~~~
command: ./Run -c 1 shell8
| unixbench shell8 | always | never | always+mthp_ext |
|------------------|---------|-----------------|-----------------|
| Score | 23019.4 | 24378.3 (5.90%) | 24314.5 (5.63%) |
mthp_ext improved by 5.63%.
kernbench results
~~~~~~~~~~~~~~~~~
When cgroup memory.high=max, mthp_ext no regression.
always never always+mthp_ext
Amean user-32 19666.44 ( 0.00%) 18464.56 * 6.11%* 19650.13 * 0.08%*
Amean syst-32 1169.16 ( 0.00%) 2235.17 * -91.18%* 1169.42 ( -0.02%)
Amean elsp-32 702.51 ( 0.00%) 699.90 * 0.37%* 702.15 ( 0.05%)
BAmean-95 user-32 19665.93 ( 0.00%) 18461.86 ( 6.12%) 19647.61 ( 0.09%)
BAmean-95 syst-32 1168.68 ( 0.00%) 2234.27 ( -91.18%) 1169.20 ( -0.04%)
BAmean-95 elsp-32 702.34 ( 0.00%) 699.80 ( 0.36%) 702.04 ( 0.04%)
BAmean-99 user-32 19665.93 ( 0.00%) 18461.86 ( 6.12%) 19647.61 ( 0.09%)
BAmean-99 syst-32 1168.68 ( 0.00%) 2234.27 ( -91.18%) 1169.20 ( -0.04%)
BAmean-99 elsp-32 702.34 ( 0.00%) 699.80 ( 0.36%) 702.04 ( 0.04%)
When cgroup memory.high=2G, mthp_ext improved by 20.98%.
always never always+mthp_ext
Amean user-32 20459.89 ( 0.00%) 18517.24 * 9.49%* 19963.73 * 2.43%*
Amean syst-32 11890.63 ( 0.00%) 6681.95 * 43.80%* 9395.94 * 20.98%*
Amean elsp-32 1305.29 ( 0.00%) 928.13 * 28.89%* 1109.37 * 15.01%*
BAmean-95 user-32 20439.38 ( 0.00%) 18510.65 ( 9.44%) 19957.89 ( 2.36%)
BAmean-95 syst-32 11789.99 ( 0.00%) 6679.03 ( 43.35%) 9381.77 ( 20.43%)
BAmean-95 elsp-32 1302.18 ( 0.00%) 927.89 ( 28.74%) 1108.65 ( 14.86%)
BAmean-99 user-32 20439.38 ( 0.00%) 18510.65 ( 9.44%) 19957.89 ( 2.36%)
BAmean-99 syst-32 11789.99 ( 0.00%) 6679.03 ( 43.35%) 9381.77 ( 20.43%)
BAmean-99 elsp-32 1302.18 ( 0.00%) 927.89 ( 28.74%) 1108.65 ( 14.86%)
TODO
====
- Do not destroy the cgroup hierarchy property. If an eBPF program
already exists in the sub-cgroup, trigger an error and clear the
already set bpf_mthp_ops data.
- mthp_ext handles different "enum tva_type" values. For example, for
small-memory processes, only 4KB is used in TVA_PAGEFAULT, while
TVA_KHUGEPAGED/TVA_FORCED_COLLAPSE continues to collapse all mthp
size. Under high memory pressure, only 4KB is used for
TVA_PAGEFAULT/TVA_KHUGEPAGED, while TVA_FORCED_COLLAPSE continues to
collapse all mthp size.
- selftest
If there are additional scenarios, please let me know as well, so I can
conduct further prototype verification tests to make mTHP more
transparent.
If any of the above the strategies can be integrated into the kernel,
please let me know. I would be delighted to incorporate these strategies
into the kernel.
This series is based on linux v7.1-rc1 (26fd6bff2c05) + "mm: BPF OOM"[3]
first four patches.
Thank you very much for your comments and discussions.
[1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com
[2] https://lore.kernel.org/linux-mm/20251026100159.6103-1-laoar.shao@gmail.com
[3] https://lore.kernel.org/linux-mm/20260127024421.494929-1-roman.gushchin@linux.dev
[4] https://github.com/vernon2gh/app_and_module/tree/main/mthp_ext
Vernon Yang (4):
psi: add psi_group_flush_stats() function
bpf: add bpf_cgroup_{flush_stats,stall} function
mm: introduce bpf_mthp_ops struct ops
samples: bpf: add mthp_ext
MAINTAINERS | 3 +
include/linux/bpf_huge_memory.h | 35 ++++
include/linux/cgroup-defs.h | 1 +
include/linux/huge_mm.h | 6 +
include/linux/psi.h | 1 +
kernel/bpf/helpers.c | 29 +++
kernel/sched/psi.c | 34 +++-
mm/Kconfig | 14 ++
mm/Makefile | 1 +
mm/bpf_huge_memory.c | 169 ++++++++++++++++
samples/bpf/.gitignore | 1 +
samples/bpf/Makefile | 7 +-
samples/bpf/mthp_ext.bpf.c | 142 +++++++++++++
samples/bpf/mthp_ext.c | 340 ++++++++++++++++++++++++++++++++
samples/bpf/mthp_ext.h | 30 +++
15 files changed, 804 insertions(+), 9 deletions(-)
create mode 100644 include/linux/bpf_huge_memory.h
create mode 100644 mm/bpf_huge_memory.c
create mode 100644 samples/bpf/mthp_ext.bpf.c
create mode 100644 samples/bpf/mthp_ext.c
create mode 100644 samples/bpf/mthp_ext.h
--
2.53.0
next reply other threads:[~2026-05-03 16:50 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-03 16:50 Vernon Yang [this message]
2026-05-03 16:50 ` [PATCH 1/4] psi: add psi_group_flush_stats() function Vernon Yang
2026-05-03 16:50 ` [PATCH 2/4] bpf: add bpf_cgroup_{flush_stats,stall} function Vernon Yang
2026-05-03 17:23 ` bot+bpf-ci
2026-05-06 12:38 ` Vernon Yang
[not found] ` <20260503172520.376AAC2BCB4@smtp.kernel.org>
2026-05-06 12:55 ` Vernon Yang
2026-05-03 16:50 ` [PATCH 3/4] mm: introduce bpf_mthp_ops struct ops Vernon Yang
2026-05-03 17:35 ` bot+bpf-ci
2026-05-06 13:06 ` Vernon Yang
[not found] ` <20260503174125.2C949C2BCB4@smtp.kernel.org>
2026-05-06 13:26 ` Vernon Yang
2026-05-03 16:50 ` [PATCH 4/4] samples: bpf: add mthp_ext Vernon Yang
2026-05-03 17:35 ` bot+bpf-ci
2026-05-06 13:30 ` Vernon Yang
[not found] ` <20260503175737.6190AC2BCB4@smtp.kernel.org>
2026-05-06 13:50 ` Vernon Yang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260503165024.1526680-1-vernon2gm@gmail.com \
--to=vernon2gm@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=ast@kernel.org \
--cc=baohua@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=david@kernel.org \
--cc=dev.jain@arm.com \
--cc=inwardvessel@gmail.com \
--cc=lance.yang@linux.dev \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=roman.gushchin@linux.dev \
--cc=shakeel.butt@linux.dev \
--cc=surenb@google.com \
--cc=yanglincheng@kylinos.cn \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox