[PATCH 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Vernon Yang <vernon2gm@gmail.com>
To: akpm@linux-foundation.org, david@kernel.org, ljs@kernel.org,
	roman.gushchin@linux.dev, inwardvessel@gmail.com,
	shakeel.butt@linux.dev, ast@kernel.org, daniel@iogearbox.net,
	surenb@google.com
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	bpf@vger.kernel.org, baohua@kernel.org, lance.yang@linux.dev,
	dev.jain@arm.com, Vernon Yang <yanglincheng@kylinos.cn>
Subject: [PATCH 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent
Date: Mon,  4 May 2026 00:50:20 +0800	[thread overview]
Message-ID: <20260503165024.1526680-1-vernon2gm@gmail.com> (raw)

From: Vernon Yang <yanglincheng@kylinos.cn>

Hi all,

Background
==========

As is well known, a system can simultaneously run multiple different
scenarios. However, THP is not beneficial in every scenario — it is only
most suitable for memory-intensive applications that are not sensitive
to tail latency. For example, Redis, which is sensitive to tail latency,
is not suitable for THP. But in practice, due to Redis issues, the
entire THP functionality is often turned off, preventing other scenarios
from benefiting from it.

There are also some embedded scenarios (e.g. Android) that directly use
2MB THP, where the granularity is too large. Therefore, we introduced
mTHP in v6.8, which supports multiple-size THP. In practice, however, we
still globally fix a single mTHP size and are unable to automatically
select different mTHP sizes based on different scenarios.

After testing, it was found that

- When the system has a lot of free memory, it is normal for Redis to
  use mTHP. performance degradation in Redis only occurs when the system
  is under high memory pressure.
- Additionally, when a large number of small-memory processes use mTHP,
  memory waste is prone to occur, and performance degradation may also
  happen during fast memory allocation/release.

Previously, "Cgroup-based THP control"[1] was proposed, but it had the
following issues.

- It breaks the cgroup hierarchy property.
- Add new THP knobs, making sysadmin's job more complex

Previously, "mm, bpf: BPF-MM, BPF-THP"[2] was proposed, but it had the
following issues.

- It didn't address the issue on the per-process mode.
- For global mode, the prctl(PR_SET_THP_DISABLE) has already achieved
  the same objective, there is no need to add two mechanisms for the
  same purpose.
- Attaching st_ops to mm_struct, the same issues that cgroup-bpf once
  faced are likely to arise again, e.g. lifetime of cgroup vs bpf, dying
  cgroups, wq deadlock, etc. It is recommended to use cgroup-bpf for
  implementation.
- The test cases are too simplistic, lacking eBPF cases similar to real
  workloads such as sched_ext.

If I miss some thing, please let me know. Thanks!

Solution
========

This series will solve all the problems mentioned above.

1. Using cgroup-bpf to customize mTHP size for different scenarios
2. Use a cgroup eBPF program to monitor all sub-cgroups. Sub-cgroups
   under the same parent-cgroup adopt the same eBPF program. Only multiple
   sibling-cgroups (where the parent-cgroup has no attached eBPF program)
   are supported to attach multiple different eBPF programs without
   breaking the hierarchy property of the cgroup.
3. Automatically select different mTHP sizes for different cgroups,
   let's focus on making them truly transparent.
4. Design mthp_ext case to address real workload issues.

The main functions of the mthp_ext are as follows:

- When sub-cgroup is under high memory pressure (default, full 100ms 1s),
  it will automatically fallback to using 4KB.
- When the anon+shmem memory usage of sub-cgroup falls below the minimum
  memory (default 16MB), small-memory processes will automatically
  fallback to using 4KB.
- Under normal conditions, when there is no memory pressure and the
  anon+shmem memory usage exceeds the minimum memory, all mTHP sizes
  shall be utilized by kernel.
- Monitor the root-cgroup (/sys/fs/cgroup) directory by default, with
  support for specifying any cgroup directory.

Performance
===========

The below is some performance test results, testing on x86_64 machine
(AMD Ryzen9 9950X 16C32T, 32G memory, 8G zram).

NOTE: The following always/never labels indicate setting all mTHP sizes
to always/never. Detailed test script reference[4].

redis results
~~~~~~~~~~~~~

command: redis-benchmark --csv -r 3000000 -n 3000000 -d 1024 -c 16 -P 32 -t set

When cgroup memory.high=max.

| redis-noBGSAVE | always      | never                | always+mthp_ext      |
|----------------|-------------|----------------------|----------------------|
| rps            | 1410824.167 | 1210387.500 (-14.2%) | 1265659.833 (-10.3%) |
| avg_latency_ms | 0.220       | 0.259       (-17.7%) | 0.247       (-12.3%) |
| p95_latency_ms | 0.618       | 0.708       (-14.6%) | 0.676       (-9.40%) |
| p99_latency_ms | 0.687       | 0.818       (-19.1%) | 0.756       (-10.0%) |

| redis-BGSAVE   | always      | never                | always+mthp_ext      |
|----------------|-------------|----------------------|----------------------|
| rps            | 1418032.127 | 1212306.873 (-14.5%) | 1261069.373 (-11.1%) |
| avg_latency_ms | 0.218       | 0.259       (-18.8%) | 0.248       (-13.8%) |
| p95_latency_ms | 0.620       | 0.714       (-15.2%) | 0.687       (-10.8%) |
| p99_latency_ms | 0.684       | 0.828       (-21.1%) | 0.756       (-10.5%) |

When cgroup memory.high=2G.

| redis-noBGSAVE | always    | never                 | always+mthp_ext       |
|----------------|-----------|-----------------------|-----------------------|
| rps            | 24813.980 | 1049254.583 (4128.5%) | 1063171.270 (4184.6%) |
| avg_latency_ms | 13.317    | 0.302       (  97.7%) | 0.298       (  97.8%) |
| p95_latency_ms | 23.220    | 0.754       (  96.8%) | 0.828       (  96.4%) |
| p99_latency_ms | 369.492   | 1.154       (  99.7%) | 1.615       (  99.6%) |

| redis-BGSAVE   | always    | never                 | always+mthp_ext       |
|----------------|-----------|-----------------------|-----------------------|
| rps            | 48373.433 | 1058403.500 (2088.0%) | 1070805.707 (2113.6%) |
| avg_latency_ms | 6.884     | 0.300       (  95.6%) | 0.296       (  95.7%) |
| p95_latency_ms | 16.474    | 0.743       (  95.5%) | 0.820       (  95.0%) |
| p99_latency_ms | 326.058   | 1.170       (  99.6%) | 1.586       (  99.5%) |

When the redis is under no memory pressure, RPS drops by 10.3% (from 1.4M to
1.2M, Is this within the acceptable range?).

However, under high memory pressure, RPS improve by 4184.6% (from 24K to 1M),
while significantly reducing the tail latency by 99%.

unixbench results
~~~~~~~~~~~~~~~~~

command: ./Run -c 1 shell8

| unixbench shell8 | always  |      never      | always+mthp_ext |
|------------------|---------|-----------------|-----------------|
| Score            | 23019.4 | 24378.3 (5.90%) | 24314.5 (5.63%) |

mthp_ext improved by 5.63%.

kernbench results
~~~~~~~~~~~~~~~~~

When cgroup memory.high=max, mthp_ext no regression.

                            always                 never               always+mthp_ext
Amean     user-32    19666.44 (   0.00%)    18464.56 *   6.11%*    19650.13 *   0.08%*
Amean     syst-32     1169.16 (   0.00%)     2235.17 * -91.18%*     1169.42 (  -0.02%)
Amean     elsp-32      702.51 (   0.00%)      699.90 *   0.37%*      702.15 (   0.05%)
BAmean-95 user-32    19665.93 (   0.00%)    18461.86 (   6.12%)    19647.61 (   0.09%)
BAmean-95 syst-32     1168.68 (   0.00%)     2234.27 ( -91.18%)     1169.20 (  -0.04%)
BAmean-95 elsp-32      702.34 (   0.00%)      699.80 (   0.36%)      702.04 (   0.04%)
BAmean-99 user-32    19665.93 (   0.00%)    18461.86 (   6.12%)    19647.61 (   0.09%)
BAmean-99 syst-32     1168.68 (   0.00%)     2234.27 ( -91.18%)     1169.20 (  -0.04%)
BAmean-99 elsp-32      702.34 (   0.00%)      699.80 (   0.36%)      702.04 (   0.04%)

When cgroup memory.high=2G, mthp_ext improved by 20.98%.

                            always                 never               always+mthp_ext
Amean     user-32    20459.89 (   0.00%)    18517.24 *   9.49%*    19963.73 *   2.43%*
Amean     syst-32    11890.63 (   0.00%)     6681.95 *  43.80%*     9395.94 *  20.98%*
Amean     elsp-32     1305.29 (   0.00%)      928.13 *  28.89%*     1109.37 *  15.01%*
BAmean-95 user-32    20439.38 (   0.00%)    18510.65 (   9.44%)    19957.89 (   2.36%)
BAmean-95 syst-32    11789.99 (   0.00%)     6679.03 (  43.35%)     9381.77 (  20.43%)
BAmean-95 elsp-32     1302.18 (   0.00%)      927.89 (  28.74%)     1108.65 (  14.86%)
BAmean-99 user-32    20439.38 (   0.00%)    18510.65 (   9.44%)    19957.89 (   2.36%)
BAmean-99 syst-32    11789.99 (   0.00%)     6679.03 (  43.35%)     9381.77 (  20.43%)
BAmean-99 elsp-32     1302.18 (   0.00%)      927.89 (  28.74%)     1108.65 (  14.86%)

TODO
====

- Do not destroy the cgroup hierarchy property. If an eBPF program
  already exists in the sub-cgroup, trigger an error and clear the
  already set bpf_mthp_ops data.
- mthp_ext handles different "enum tva_type" values. For example, for
  small-memory processes, only 4KB is used in TVA_PAGEFAULT, while
  TVA_KHUGEPAGED/TVA_FORCED_COLLAPSE continues to collapse all mthp
  size. Under high memory pressure, only 4KB is used for
  TVA_PAGEFAULT/TVA_KHUGEPAGED, while TVA_FORCED_COLLAPSE continues to
  collapse all mthp size.
- selftest

If there are additional scenarios, please let me know as well, so I can
conduct further prototype verification tests to make mTHP more
transparent.

If any of the above the strategies can be integrated into the kernel,
please let me know. I would be delighted to incorporate these strategies
into the kernel.

This series is based on linux v7.1-rc1 (26fd6bff2c05) + "mm: BPF OOM"[3]
first four patches.

Thank you very much for your comments and discussions.

[1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com
[2] https://lore.kernel.org/linux-mm/20251026100159.6103-1-laoar.shao@gmail.com
[3] https://lore.kernel.org/linux-mm/20260127024421.494929-1-roman.gushchin@linux.dev
[4] https://github.com/vernon2gh/app_and_module/tree/main/mthp_ext

Vernon Yang (4):
  psi: add psi_group_flush_stats() function
  bpf: add bpf_cgroup_{flush_stats,stall} function
  mm: introduce bpf_mthp_ops struct ops
  samples: bpf: add mthp_ext

 MAINTAINERS                     |   3 +
 include/linux/bpf_huge_memory.h |  35 ++++
 include/linux/cgroup-defs.h     |   1 +
 include/linux/huge_mm.h         |   6 +
 include/linux/psi.h             |   1 +
 kernel/bpf/helpers.c            |  29 +++
 kernel/sched/psi.c              |  34 +++-
 mm/Kconfig                      |  14 ++
 mm/Makefile                     |   1 +
 mm/bpf_huge_memory.c            | 169 ++++++++++++++++
 samples/bpf/.gitignore          |   1 +
 samples/bpf/Makefile            |   7 +-
 samples/bpf/mthp_ext.bpf.c      | 142 +++++++++++++
 samples/bpf/mthp_ext.c          | 340 ++++++++++++++++++++++++++++++++
 samples/bpf/mthp_ext.h          |  30 +++
 15 files changed, 804 insertions(+), 9 deletions(-)
 create mode 100644 include/linux/bpf_huge_memory.h
 create mode 100644 mm/bpf_huge_memory.c
 create mode 100644 samples/bpf/mthp_ext.bpf.c
 create mode 100644 samples/bpf/mthp_ext.c
 create mode 100644 samples/bpf/mthp_ext.h

--
2.53.0

next             reply	other threads:[~2026-05-03 16:50 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-03 16:50 Vernon Yang [this message]
2026-05-03 16:50 ` [PATCH 1/4] psi: add psi_group_flush_stats() function Vernon Yang
2026-05-03 16:50 ` [PATCH 2/4] bpf: add bpf_cgroup_{flush_stats,stall} function Vernon Yang
2026-05-03 17:23   ` bot+bpf-ci
2026-05-06 12:38     ` Vernon Yang
     [not found]   ` <20260503172520.376AAC2BCB4@smtp.kernel.org>
2026-05-06 12:55     ` Vernon Yang
2026-05-03 16:50 ` [PATCH 3/4] mm: introduce bpf_mthp_ops struct ops Vernon Yang
2026-05-03 17:35   ` bot+bpf-ci
2026-05-06 13:06     ` Vernon Yang
     [not found]   ` <20260503174125.2C949C2BCB4@smtp.kernel.org>
2026-05-06 13:26     ` Vernon Yang
2026-05-03 16:50 ` [PATCH 4/4] samples: bpf: add mthp_ext Vernon Yang
2026-05-03 17:35   ` bot+bpf-ci
2026-05-06 13:30     ` Vernon Yang
     [not found]   ` <20260503175737.6190AC2BCB4@smtp.kernel.org>
2026-05-06 13:50     ` Vernon Yang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260503165024.1526680-1-vernon2gm@gmail.com \
    --to=vernon2gm@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=ast@kernel.org \
    --cc=baohua@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=inwardvessel@gmail.com \
    --cc=lance.yang@linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=surenb@google.com \
    --cc=yanglincheng@kylinos.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox