[PATCH v2 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent
@ 2026-05-08 15:00 Vernon Yang
  2026-05-08 15:00 ` [PATCH v2 1/4] psi: add psi_group_flush_stats() function Vernon Yang
                   ` (5 more replies)
  0 siblings, 6 replies; 16+ messages in thread
From: Vernon Yang @ 2026-05-08 15:00 UTC (permalink / raw)
  To: akpm, david, ljs, roman.gushchin, inwardvessel, shakeel.butt, ast,
	daniel, surenb
  Cc: tz2294, baohua, lance.yang, dev.jain, laoar.shao, gutierrez.asier,
	linux-kernel, linux-mm, bpf, Vernon Yang

From: Vernon Yang <yanglincheng@kylinos.cn>

Hi all,

Background
==========

As is well known, a system can simultaneously run multiple different
scenarios. However, THP is not beneficial in every scenario — it is only
most suitable for memory-intensive applications that are not sensitive
to tail latency. For example, Redis, which is sensitive to tail latency,
is not suitable for THP. But in practice, due to Redis issues, the
entire THP functionality is often turned off, preventing other scenarios
from benefiting from it.

There are also some embedded scenarios (e.g. Android) that directly use
2MB THP, where the granularity is too large. Therefore, we introduced
mTHP in v6.8, which supports multiple-size THP. In practice, however, we
still globally fix a single mTHP size and are unable to automatically
select different mTHP sizes based on different scenarios.

After testing, it was found that

- When the system has a lot of free memory, it is normal for Redis to
  use mTHP. performance degradation in Redis only occurs when the system
  is under high memory pressure.
- Additionally, when a large number of small-memory processes use mTHP,
  memory waste is prone to occur, and performance degradation may also
  happen during fast memory allocation/release.

Previously, "Cgroup-based THP control"[1] was proposed, but it had the
following issues.

- It breaks the cgroup hierarchy property.
- Add new THP knobs, making sysadmin's job more complex

Previously, "mm, bpf: BPF-MM, BPF-THP"[2] was proposed, but it had the
following issues.

- It didn't address the issue on the per-process mode.
- For global mode, the prctl(PR_SET_THP_DISABLE) has already achieved
  the same objective, there is no need to add two mechanisms for the
  same purpose.
- Attaching st_ops to mm_struct, the same issues that cgroup-bpf once
  faced are likely to arise again, e.g. lifetime of cgroup vs bpf, dying
  cgroups, wq deadlock, etc. It is recommended to use cgroup-bpf for
  implementation.
- Unclear ABI stability guarantees.
- The test cases are too simplistic, lacking eBPF cases similar to real
  workloads such as sched_ext.

If I miss some thing, please let me know. Thanks!

Solution
========

This series will solve all the problems mentioned above.

1. Using cgroup-bpf to customize mTHP size for different scenarios
2. Use a cgroup eBPF program to monitor all sub-cgroups. Sub-cgroups
   under the same parent-cgroup adopt the same eBPF program. Only multiple
   sibling-cgroups (where the parent-cgroup has no attached eBPF program)
   are supported to attach multiple different eBPF programs without
   breaking the hierarchy property of the cgroup.
3. Automatically select different mTHP sizes for different cgroups,
   let's focus on making them truly transparent.
4. Design mthp_ext case to address real workload issues and further
   clear/stabilize the ABI.

The main functions of the mthp_ext are as follows:

- When sub-cgroup is under high memory pressure (default, full 100ms 1s),
  it will automatically fallback to using 4KB.
- When the anon+shmem memory usage of sub-cgroup falls below the minimum
  memory (default 16MB), small-memory processes will automatically
  fallback to using 4KB.
- Under normal conditions, when there is no memory pressure and the
  anon+shmem memory usage exceeds the minimum memory, all mTHP sizes
  shall be utilized by kernel.
- Monitor the root-cgroup (/sys/fs/cgroup) directory by default, with
  support for specifying any cgroup directory.

Performance
===========

The below is some performance test results, testing on x86_64 machine
(AMD Ryzen9 9950X 16C32T, 32G memory, 8G zram).

NOTE: The following always/never labels indicate setting all mTHP sizes
to always/never. Detailed test script reference[4].

redis results
~~~~~~~~~~~~~

command: redis-benchmark --csv -r 3000000 -n 3000000 -d 1024 -c 16 -P 32 -t set

When cgroup memory.high=max, no memory pressure, seems only noise level
changes, mthp_ext no regression.

| redis-noBGSAVE | always      | never                | always+mthp_ext     |
|----------------|-------------|----------------------|---------------------|
| rps            | 1431307.083 | 1224004.250 (-14.5%) | 1420053.873 (-0.8%) |
| avg_latency_ms | 0.216       | 0.256       (-18.5%) | 0.218       (-0.9%) |
| p95_latency_ms | 0.612       | 0.708       (-15.7%) | 0.615       (-0.5%) |
| p99_latency_ms | 0.682       | 0.812       (-19.1%) | 0.692       (-1.5%) |

| redis-BGSAVE   | always      | never                | always+mthp_ext    |
|----------------|-------------|----------------------|--------------------|
| rps            | 1429093.707 | 1231569.587 (-13.8%) | 1431075.330 (0.1%) |
| avg_latency_ms | 0.216       | 0.255       (-18.1%) | 0.216       (0.0%) |
| p95_latency_ms | 0.618       | 0.706       (-14.2%) | 0.615       (0.5%) |
| p99_latency_ms | 0.684       | 0.823       (-20.3%) | 0.684       (0.0%) |

When cgroup memory.high=2G, high memory pressure, mthp_ext RPS improve by
3450%, while significantly reducing the tail latency by 99%.

| redis-noBGSAVE | always    | never                | always+mthp_ext      |
|----------------|-----------|----------------------|----------------------|
| rps            | 24932.790 | 976610.893 (3817.0%) | 885337.250 (3450.9%) |
| avg_latency_ms | 13.173    | 0.326        (97.5%) | 0.367        (97.2%) |
| p95_latency_ms | 23.028    | 0.786        (96.6%) | 1.511        (93.4%) |
| p99_latency_ms | 366.762   | 1.183        (99.7%) | 2.975        (99.2%) |

| redis-BGSAVE   | always    | never                 | always+mthp_ext      |
|----------------|-----------|-----------------------|----------------------|
| rps            | 50551.567 | 1026720.293 (1931.0%) | 892643.707 (1665.8%) |
| avg_latency_ms | 6.581     | 0.310         (95.3%) | 0.365        (94.5%) |
| p95_latency_ms | 16.730    | 0.772         (95.4%) | 1.447        (91.4%) |
| p99_latency_ms | 311.551   | 1.140         (99.6%) | 2.988        (99.0%) |

unixbench results
~~~~~~~~~~~~~~~~~

command: ./Run -c 1 shell8

mthp_ext improved by 5.99%.

| unixbench shell8 | always  | never           | always+mthp_ext |
|------------------|---------|-----------------|-----------------|
| Score            | 22916.8 | 24304.0 (6.05%) | 24289.9 (5.99%) |

kernbench results
~~~~~~~~~~~~~~~~~

When cgroup memory.high=max, no memory pressure, seems only noise level
changes, mthp_ext no regression.

                            always                 never               always+mthp_ext
Amean     user-32    19702.39 (   0.00%)    18428.90 *   6.46%*    19706.73 (  -0.02%)
Amean     syst-32     1159.55 (   0.00%)     2252.43 * -94.25%*     1177.48 *  -1.55%*
Amean     elsp-32      703.28 (   0.00%)      699.10 *   0.59%*      703.99 *  -0.10%*
BAmean-95 user-32    19701.79 (   0.00%)    18425.01 (   6.48%)    19704.78 (  -0.02%)
BAmean-95 syst-32     1159.43 (   0.00%)     2251.86 ( -94.22%)     1177.03 (  -1.52%)
BAmean-95 elsp-32      703.24 (   0.00%)      698.99 (   0.61%)      703.88 (  -0.09%)
BAmean-99 user-32    19701.79 (   0.00%)    18425.01 (   6.48%)    19704.78 (  -0.02%)
BAmean-99 syst-32     1159.43 (   0.00%)     2251.86 ( -94.22%)     1177.03 (  -1.52%)
BAmean-99 elsp-32      703.24 (   0.00%)      698.99 (   0.61%)      703.88 (  -0.09%)

When cgroup memory.high=2G, high memory pressure, mthp_ext improved by 26%.

                            always                 never               always+mthp_ext
Amean     user-32    20250.65 (   0.00%)    18368.91 *   9.29%*    18681.27 *   7.75%*
Amean     syst-32    12778.56 (   0.00%)     9636.99 *  24.58%*     9392.65 *  26.50%*
Amean     elsp-32     1377.55 (   0.00%)     1026.10 *  25.51%*     1019.40 *  26.00%*
BAmean-95 user-32    20233.75 (   0.00%)    18353.57 (   9.29%)    18678.01 (   7.69%)
BAmean-95 syst-32    12543.21 (   0.00%)     9612.28 (  23.37%)     9386.83 (  25.16%)
BAmean-95 elsp-32     1367.82 (   0.00%)     1023.75 (  25.15%)     1018.17 (  25.56%)
BAmean-99 user-32    20233.75 (   0.00%)    18353.57 (   9.29%)    18678.01 (   7.69%)
BAmean-99 syst-32    12543.21 (   0.00%)     9612.28 (  23.37%)     9386.83 (  25.16%)
BAmean-99 elsp-32     1367.82 (   0.00%)     1023.75 (  25.15%)     1018.17 (  25.56%)

TODO
====

- mthp_ext handles different "enum tva_type" values. For example, for
  small-memory processes, only 4KB is used in TVA_PAGEFAULT, while
  TVA_KHUGEPAGED/TVA_FORCED_COLLAPSE continues to collapse all mthp
  size. Under high memory pressure, only 4KB is used for
  TVA_PAGEFAULT/TVA_KHUGEPAGED, while TVA_FORCED_COLLAPSE continues to
  collapse all mthp size.
- selftest

If there are additional scenarios, please let me know as well, so I can
conduct further prototype verification tests to make mTHP more
transparent and further clear/stabilize the BPF-THP ABI.

If any of the above the strategies can be integrated into the kernel,
please let me know. I would be delighted to incorporate these strategies
into the kernel.

This series is based on mm-new + "mm: BPF OOM"[3] first four patches.

Thank you very much for your comments and discussions.

[1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com
[2] https://lore.kernel.org/linux-mm/20251026100159.6103-1-laoar.shao@gmail.com
[3] https://lore.kernel.org/linux-mm/20260127024421.494929-1-roman.gushchin@linux.dev
[4] https://github.com/vernon2gh/app_and_module/tree/main/mthp_ext

V1 -> V2:
- Rebase on mm-new, run all performance tests again.
- Register eBPF programs only when no mthp_ops exists in all sub-cgroup, do not
  destroy the cgroup hierarchy property.
- Fix newly created cgroups silently bypass the hierarchical BPF mTHP policy.
- Fix bpf_mthp_choose() UAF due to improper SRCU locking.
- Add bounds check in bpf_cgroup_stall() and fix return type to u64.
- Check cgroup_psi() return value.
- Fix spurious mTHP fallback during initial cgroup scan due to zero-init
  info->stall.
- Fix info->order being set to 0 when no processes are running in the cgroup.
- Fix Compilation fails when CONFIG_CGROUPS=y && CONFIG_PSI=n.
- Fix NULL pointer dereference of st_link.
- FIx infinite loop in trigger_scan() when read() returns an error.
- Fix integer overflow in FROM_MB() macro.
- Fix setup_psi_trigger() fail, but masks the error code.

V1 : https://lore.kernel.org/linux-mm/20260503165024.1526680-1-vernon2gm@gmail.com/

Vernon Yang (4):
  psi: add psi_group_flush_stats() function
  bpf: add bpf_cgroup_{flush_stats,stall} function
  mm: introduce bpf_mthp_ops struct ops
  samples: bpf: add mthp_ext

 MAINTAINERS                     |   3 +
 include/linux/bpf_huge_memory.h |  52 +++++
 include/linux/cgroup-defs.h     |   1 +
 include/linux/huge_mm.h         |   6 +
 include/linux/psi.h             |   5 +
 kernel/bpf/helpers.c            |  34 ++++
 kernel/cgroup/cgroup.c          |   2 +
 kernel/sched/psi.c              |  34 +++-
 mm/Kconfig                      |  14 ++
 mm/Makefile                     |   1 +
 mm/bpf_huge_memory.c            | 168 ++++++++++++++++
 samples/bpf/.gitignore          |   1 +
 samples/bpf/Makefile            |   7 +-
 samples/bpf/mthp_ext.bpf.c      | 148 ++++++++++++++
 samples/bpf/mthp_ext.c          | 339 ++++++++++++++++++++++++++++++++
 samples/bpf/mthp_ext.h          |  30 +++
 16 files changed, 836 insertions(+), 9 deletions(-)
 create mode 100644 include/linux/bpf_huge_memory.h
 create mode 100644 mm/bpf_huge_memory.c
 create mode 100644 samples/bpf/mthp_ext.bpf.c
 create mode 100644 samples/bpf/mthp_ext.c
 create mode 100644 samples/bpf/mthp_ext.h

--
2.53.0



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 1/4] psi: add psi_group_flush_stats() function
  2026-05-08 15:00 [PATCH v2 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent Vernon Yang
@ 2026-05-08 15:00 ` Vernon Yang
  2026-05-08 15:19   ` Lorenzo Stoakes
  2026-05-08 15:00 ` [PATCH v2 2/4] bpf: add bpf_cgroup_{flush_stats,stall} function Vernon Yang
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Vernon Yang @ 2026-05-08 15:00 UTC (permalink / raw)
  To: akpm, david, ljs, roman.gushchin, inwardvessel, shakeel.butt, ast,
	daniel, surenb
  Cc: tz2294, baohua, lance.yang, dev.jain, laoar.shao, gutierrez.asier,
	linux-kernel, linux-mm, bpf, Vernon Yang

From: Vernon Yang <yanglincheng@kylinos.cn>

Add psi_group_flush_stats() function to prepare for the subsequent
mthp_ext ebpf program.

no function changes.

Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
 include/linux/psi.h |  1 +
 kernel/sched/psi.c  | 34 ++++++++++++++++++++++++++--------
 2 files changed, 27 insertions(+), 8 deletions(-)

diff --git a/include/linux/psi.h b/include/linux/psi.h
index e0745873e3f2..7b4fd8190810 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -22,6 +22,7 @@ void psi_init(void);
 void psi_memstall_enter(unsigned long *flags);
 void psi_memstall_leave(unsigned long *flags);
 
+void psi_group_flush_stats(struct psi_group *group);
 int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res);
 struct psi_trigger *psi_trigger_create(struct psi_group *group, char *buf,
 				       enum psi_res res, struct file *file,
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index d9c9d9480a45..76ffad90b0b5 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -1242,11 +1242,35 @@ void psi_cgroup_restart(struct psi_group *group)
 }
 #endif /* CONFIG_CGROUPS */
 
+/*
+ * __psi_group_flush_stats - flush the total stall time of a psi group
+ * @group: psi group to flush
+ */
+static void __psi_group_flush_stats(struct psi_group *group)
+{
+	u64 now;
+
+	/* Update averages before reporting them */
+	mutex_lock(&group->avgs_lock);
+	now = sched_clock();
+	collect_percpu_times(group, PSI_AVGS, NULL);
+	if (now >= group->avg_next_update)
+		group->avg_next_update = update_averages(group, now);
+	mutex_unlock(&group->avgs_lock);
+}
+
+void psi_group_flush_stats(struct psi_group *group)
+{
+	if (static_branch_likely(&psi_disabled))
+		return;
+
+	__psi_group_flush_stats(group);
+}
+
 int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
 {
 	bool only_full = false;
 	int full;
-	u64 now;
 
 	if (static_branch_likely(&psi_disabled))
 		return -EOPNOTSUPP;
@@ -1256,13 +1280,7 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
 		return -EOPNOTSUPP;
 #endif
 
-	/* Update averages before reporting them */
-	mutex_lock(&group->avgs_lock);
-	now = sched_clock();
-	collect_percpu_times(group, PSI_AVGS, NULL);
-	if (now >= group->avg_next_update)
-		group->avg_next_update = update_averages(group, now);
-	mutex_unlock(&group->avgs_lock);
+	__psi_group_flush_stats(group);
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 	only_full = res == PSI_IRQ;
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 2/4] bpf: add bpf_cgroup_{flush_stats,stall} function
  2026-05-08 15:00 [PATCH v2 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent Vernon Yang
  2026-05-08 15:00 ` [PATCH v2 1/4] psi: add psi_group_flush_stats() function Vernon Yang
@ 2026-05-08 15:00 ` Vernon Yang
  2026-05-08 15:40   ` bot+bpf-ci
  2026-05-08 15:00 ` [PATCH v2 3/4] mm: introduce bpf_mthp_ops struct ops Vernon Yang
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Vernon Yang @ 2026-05-08 15:00 UTC (permalink / raw)
  To: akpm, david, ljs, roman.gushchin, inwardvessel, shakeel.butt, ast,
	daniel, surenb
  Cc: tz2294, baohua, lance.yang, dev.jain, laoar.shao, gutierrez.asier,
	linux-kernel, linux-mm, bpf, Vernon Yang

From: Vernon Yang <yanglincheng@kylinos.cn>

Add bpf_cgroup_{flush_stats,stall} function to prepare for the
subsequent mthp_ext ebpf program.

no function changes.

Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
 include/linux/psi.h  |  4 ++++
 kernel/bpf/helpers.c | 34 ++++++++++++++++++++++++++++++++++
 2 files changed, 38 insertions(+)

diff --git a/include/linux/psi.h b/include/linux/psi.h
index 7b4fd8190810..243dcf97bea4 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -52,6 +52,10 @@ static inline void psi_memstall_enter(unsigned long *flags) {}
 static inline void psi_memstall_leave(unsigned long *flags) {}
 
 #ifdef CONFIG_CGROUPS
+static inline struct psi_group *cgroup_psi(struct cgroup *cgrp)
+{
+	return NULL;
+}
 static inline int psi_cgroup_alloc(struct cgroup *cgrp)
 {
 	return 0;
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 2bb60200c266..1c353e0ff14f 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -29,6 +29,7 @@
 #include <linux/task_work.h>
 #include <linux/irq_work.h>
 #include <linux/buildid.h>
+#include <linux/psi.h>
 
 #include "../../lib/kstrtox.h"
 
@@ -2881,6 +2882,37 @@ bpf_task_get_cgroup1(struct task_struct *task, int hierarchy_id)
 		return NULL;
 	return cgrp;
 }
+
+/**
+ * bpf_cgroup_stall - acquire the total stall time of cgroup
+ * @cgrp: cgroup struct
+ * @states: psi states
+ *
+ * Return the total stall time.
+ */
+__bpf_kfunc u64 bpf_cgroup_stall(struct cgroup *cgrp, enum psi_states states)
+{
+	struct psi_group *group = cgroup_psi(cgrp);
+
+	if (unlikely(!group || (u32)states >= NR_PSI_STATES - 1))
+		return (u64)-1;
+
+	return div_u64(group->total[PSI_AVGS][states], NSEC_PER_MSEC);
+}
+
+/**
+ * bpf_cgroup_flush_stats - Flush cgroup's statistics
+ * @cgrp: cgroup struct
+ */
+__bpf_kfunc void bpf_cgroup_flush_stats(struct cgroup *cgrp)
+{
+	struct psi_group *group = cgroup_psi(cgrp);
+
+	if (unlikely(!group))
+		return;
+
+	psi_group_flush_stats(group);
+}
 #endif /* CONFIG_CGROUPS */
 
 /**
@@ -4734,6 +4766,8 @@ BTF_ID_FLAGS(func, bpf_cgroup_ancestor, KF_ACQUIRE | KF_RCU | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_cgroup_from_id, KF_ACQUIRE | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_task_under_cgroup, KF_RCU)
 BTF_ID_FLAGS(func, bpf_task_get_cgroup1, KF_ACQUIRE | KF_RCU | KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_cgroup_stall)
+BTF_ID_FLAGS(func, bpf_cgroup_flush_stats, KF_SLEEPABLE)
 #endif
 BTF_ID_FLAGS(func, bpf_task_from_pid, KF_ACQUIRE | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_task_from_vpid, KF_ACQUIRE | KF_RET_NULL)
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 3/4] mm: introduce bpf_mthp_ops struct ops
  2026-05-08 15:00 [PATCH v2 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent Vernon Yang
  2026-05-08 15:00 ` [PATCH v2 1/4] psi: add psi_group_flush_stats() function Vernon Yang
  2026-05-08 15:00 ` [PATCH v2 2/4] bpf: add bpf_cgroup_{flush_stats,stall} function Vernon Yang
@ 2026-05-08 15:00 ` Vernon Yang
  2026-05-08 15:40   ` bot+bpf-ci
                     ` (2 more replies)
  2026-05-08 15:00 ` [PATCH v2 4/4] samples: bpf: add mthp_ext Vernon Yang
                   ` (2 subsequent siblings)
  5 siblings, 3 replies; 16+ messages in thread
From: Vernon Yang @ 2026-05-08 15:00 UTC (permalink / raw)
  To: akpm, david, ljs, roman.gushchin, inwardvessel, shakeel.butt, ast,
	daniel, surenb
  Cc: tz2294, baohua, lance.yang, dev.jain, laoar.shao, gutierrez.asier,
	linux-kernel, linux-mm, bpf, Vernon Yang

From: Vernon Yang <yanglincheng@kylinos.cn>

Introducing bpf_mthp_ops enables eBPF programs to register the
mthp_choose callback function via cgroup-ebpf.

Using cgroup-bpf to customize mTHP size for different scenarios，
automatically select different mTHP sizes for different cgroups,
let's focus on making them truly transparent.

Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
 MAINTAINERS                     |   3 +
 include/linux/bpf_huge_memory.h |  52 ++++++++++
 include/linux/cgroup-defs.h     |   1 +
 include/linux/huge_mm.h         |   6 ++
 kernel/cgroup/cgroup.c          |   2 +
 mm/Kconfig                      |  14 +++
 mm/Makefile                     |   1 +
 mm/bpf_huge_memory.c            | 168 ++++++++++++++++++++++++++++++++
 8 files changed, 247 insertions(+)
 create mode 100644 include/linux/bpf_huge_memory.h
 create mode 100644 mm/bpf_huge_memory.c

diff --git a/MAINTAINERS b/MAINTAINERS
index caaa0d6e6056..f1113eaa1193 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4887,7 +4887,10 @@ M:	Shakeel Butt <shakeel.butt@linux.dev>
 L:	bpf@vger.kernel.org
 L:	linux-mm@kvack.org
 S:	Maintained
+F:	include/linux/bpf_huge_memory.h
+F:	mm/bpf_huge_memory.c
 F:	mm/bpf_memcontrol.c
+F:	samples/bpf/mthp_ext.*
 
 BPF [MISC]
 L:	bpf@vger.kernel.org
diff --git a/include/linux/bpf_huge_memory.h b/include/linux/bpf_huge_memory.h
new file mode 100644
index 000000000000..ffda445c9572
--- /dev/null
+++ b/include/linux/bpf_huge_memory.h
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+
+#ifndef __BPF_HUGE_MEMORY_H
+#define __BPF_HUGE_MEMORY_H
+
+#include <linux/cgroup-defs.h>
+
+/**
+ * struct bpf_mthp_ops - BPF callbacks for mTHP operations
+ * @mthp_choose: Choose the custom mTHP orders
+ *
+ * This structure defines the interface for BPF programs to customize
+ * mTHP behavior through struct_ops programs.
+ */
+struct bpf_mthp_ops {
+	unsigned long (*mthp_choose)(struct cgroup *cgrp, unsigned long orders);
+};
+
+#ifdef CONFIG_BPF_TRANSPARENT_HUGEPAGE
+/**
+ * bpf_mthp_choose - Choose the custom mTHP orders using bpf
+ * @mm: task mm_struct
+ * @orders: original orders
+ *
+ * Return suited mTHP orders.
+ */
+unsigned long bpf_mthp_choose(struct mm_struct *mm, unsigned long orders);
+
+/**
+ * cgroup_bpf_set_mthp_ops - Set sub-cgroup mthp_ops to parent cgroup
+ * @cgrp: want to set mthp_ops of sub-cgroup
+ * @parent: parent cgroup
+ */
+static inline void cgroup_bpf_set_mthp_ops(struct cgroup *cgrp,
+					   struct cgroup *parent)
+{
+	WRITE_ONCE(cgrp->mthp_ops, parent->mthp_ops);
+}
+#else
+static inline unsigned long bpf_mthp_choose(struct mm_struct *mm,
+					    unsigned long orders)
+{
+	return orders;
+}
+static inline void cgroup_bpf_set_mthp_ops(struct cgroup *cgrp,
+					   struct cgroup *parent)
+{
+}
+#endif /* CONFIG_BPF_TRANSPARENT_HUGEPAGE */
+
+#endif /* __BPF_HUGE_MEMORY_H */
+
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index f42563739d2e..78854d0e06ab 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -628,6 +628,7 @@ struct cgroup {
 
 #ifdef CONFIG_BPF_SYSCALL
 	struct bpf_local_storage __rcu  *bpf_cgrp_storage;
+	struct bpf_mthp_ops *mthp_ops;
 #endif
 #ifdef CONFIG_EXT_SUB_SCHED
 	struct scx_sched __rcu *scx_sched;
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 127f9e1e7604..65da35fb0980 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -3,6 +3,7 @@
 #define _LINUX_HUGE_MM_H
 
 #include <linux/mm_types.h>
+#include <linux/bpf_huge_memory.h>
 
 #include <linux/fs.h> /* only for vma_is_dax() */
 #include <linux/kobject.h>
@@ -296,6 +297,11 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
 				       enum tva_type type,
 				       unsigned long orders)
 {
+	/* The eBPF-specified orders overrides which order is selected. */
+	orders &= bpf_mthp_choose(vma->vm_mm, orders);
+	if (!orders)
+		return 0;
+
 	/*
 	 * Optimization to check if required orders are enabled early. Only
 	 * forced collapse ignores sysfs configs.
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 43adc96c7f1a..1dbef3e8b179 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5836,6 +5836,8 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
 	if (ret)
 		goto out_stat_exit;
 
+	cgroup_bpf_set_mthp_ops(cgrp, parent);
+
 	for (tcgrp = cgrp; tcgrp; tcgrp = cgroup_parent(tcgrp))
 		cgrp->ancestors[tcgrp->level] = tcgrp;
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 27dc5b0139ba..be49bde783a7 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -949,6 +949,20 @@ config NO_PAGE_MAPCOUNT
 
 	  EXPERIMENTAL because the impact of some changes is still unclear.
 
+config BPF_TRANSPARENT_HUGEPAGE
+	bool "BPF-based transparent hugepage (EXPERIMENTAL)"
+	depends on TRANSPARENT_HUGEPAGE && CGROUP_BPF
+	help
+	  Using cgroup-bpf to customize mTHP size for different scenarios,
+	  automatically select different mTHP sizes for different cgroups,
+	  let's focus on making them truly transparent.
+
+	  This is an experimental feature, that might go away at any time,
+	  Please do not rely any production environment.
+
+	  EXPERIMENTAL because the BPF interface is unstable and may be removed
+	  at any time.
+
 endif # TRANSPARENT_HUGEPAGE
 
 # simple helper to make the code a bit easier to read
diff --git a/mm/Makefile b/mm/Makefile
index 8ad2ab08244e..b474c21c3253 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -108,6 +108,7 @@ obj-$(CONFIG_MEMCG) += swap_cgroup.o
 endif
 ifdef CONFIG_BPF_SYSCALL
 obj-$(CONFIG_MEMCG) += bpf_memcontrol.o
+obj-$(CONFIG_BPF_TRANSPARENT_HUGEPAGE) += bpf_huge_memory.o
 endif
 obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
 obj-$(CONFIG_GUP_TEST) += gup_test.o
diff --git a/mm/bpf_huge_memory.c b/mm/bpf_huge_memory.c
new file mode 100644
index 000000000000..851c6ebe2933
--- /dev/null
+++ b/mm/bpf_huge_memory.c
@@ -0,0 +1,168 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Huge memory related BPF code
+ *
+ * Author: Vernon Yang <yanglincheng@kylinos.cn>
+ */
+
+#include <linux/bpf.h>
+#include <linux/srcu.h>
+
+/* Protects cgrp->mthp_ops pointer for read and write. */
+DEFINE_SRCU(mthp_bpf_srcu);
+
+unsigned long bpf_mthp_choose(struct mm_struct *mm, unsigned long orders)
+{
+	struct cgroup *cgrp;
+	struct mem_cgroup *memcg;
+	struct bpf_mthp_ops *ops;
+	int idx;
+
+	memcg = get_mem_cgroup_from_mm(mm);
+	if (!memcg)
+		return orders;
+
+	cgrp = memcg->css.cgroup;
+
+	idx = srcu_read_lock(&mthp_bpf_srcu);
+	ops = READ_ONCE(cgrp->mthp_ops);
+	if (unlikely(ops && ops->mthp_choose))
+		orders = ops->mthp_choose(cgrp, orders);
+	srcu_read_unlock(&mthp_bpf_srcu, idx);
+
+	mem_cgroup_put(memcg);
+
+	return orders;
+}
+
+static int bpf_mthp_ops_btf_struct_access(struct bpf_verifier_log *log,
+		const struct bpf_reg_state *reg, int off, int size)
+{
+	return -EACCES;
+}
+
+static bool bpf_mthp_ops_is_valid_access(int off, int size, enum bpf_access_type type,
+		const struct bpf_prog *prog, struct bpf_insn_access_aux *info)
+{
+	return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
+}
+
+const struct bpf_verifier_ops bpf_mthp_verifier_ops = {
+	.get_func_proto = bpf_base_func_proto,
+	.btf_struct_access = bpf_mthp_ops_btf_struct_access,
+	.is_valid_access = bpf_mthp_ops_is_valid_access,
+};
+
+static int bpf_mthp_ops_reg(void *kdata, struct bpf_link *link)
+{
+	struct bpf_struct_ops_link *st_link = (struct bpf_struct_ops_link *)link;
+	struct bpf_mthp_ops *ops = kdata;
+	struct cgroup_subsys_state *child;
+	struct cgroup *cgrp;
+
+	if (!link)
+		return -EOPNOTSUPP;
+
+	cgrp = st_link->cgroup;
+	if (!cgrp)
+		return -EINVAL;
+
+	cgroup_lock();
+	css_for_each_descendant_pre(child, &cgrp->self) {
+		if (READ_ONCE(child->cgroup->mthp_ops)) {
+			pr_warn("sub-cgroup has already registered.\n");
+			cgroup_unlock();
+			return -EBUSY;
+		}
+	}
+	css_for_each_descendant_pre(child, &cgrp->self)
+		WRITE_ONCE(child->cgroup->mthp_ops, ops);
+	cgroup_unlock();
+
+	return 0;
+}
+
+static void bpf_mthp_ops_unreg(void *kdata, struct bpf_link *link)
+{
+	struct bpf_struct_ops_link *st_link = (struct bpf_struct_ops_link *)link;
+	struct cgroup_subsys_state *child;
+	struct cgroup *cgrp;
+
+	if (!link)
+		return;
+
+	cgrp = st_link->cgroup;
+	if (!cgrp)
+		return;
+
+	cgroup_lock();
+	css_for_each_descendant_pre(child, &cgrp->self)
+		WRITE_ONCE(child->cgroup->mthp_ops, NULL);
+	cgroup_unlock();
+
+	synchronize_srcu(&mthp_bpf_srcu);
+}
+
+static int bpf_mthp_ops_check_member(const struct btf_type *t,
+				     const struct btf_member *member,
+				     const struct bpf_prog *prog)
+{
+	u32 moff = __btf_member_bit_offset(t, member) / 8;
+
+	switch (moff) {
+	case offsetof(struct bpf_mthp_ops, mthp_choose):
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	if (prog->sleepable)
+		return -EINVAL;
+
+	return 0;
+}
+
+static int bpf_mthp_ops_init_member(const struct btf_type *t,
+				    const struct btf_member *member,
+				    void *kdata, const void *udata)
+{
+	return 0;
+}
+
+static int bpf_mthp_ops_init(struct btf *btf)
+{
+	return 0;
+}
+
+static unsigned long cfi_mthp_choose(struct cgroup *cgrp, unsigned long orders)
+{
+	return 0;
+}
+
+static struct bpf_mthp_ops cfi_bpf_mthp_ops = {
+	.mthp_choose = cfi_mthp_choose,
+};
+
+static struct bpf_struct_ops bso_bpf_mthp_ops = {
+	.verifier_ops = &bpf_mthp_verifier_ops,
+	.reg = bpf_mthp_ops_reg,
+	.unreg = bpf_mthp_ops_unreg,
+	.check_member = bpf_mthp_ops_check_member,
+	.init_member = bpf_mthp_ops_init_member,
+	.init = bpf_mthp_ops_init,
+	.name = "bpf_mthp_ops",
+	.owner = THIS_MODULE,
+	.cfi_stubs = &cfi_bpf_mthp_ops,
+};
+
+static int __init bpf_huge_memory_init(void)
+{
+	int err;
+
+	err = register_bpf_struct_ops(&bso_bpf_mthp_ops, bpf_mthp_ops);
+	if (err)
+		pr_warn("Registration of bpf_mthp_ops failed, err %d\n", err);
+
+	return err;
+}
+late_initcall(bpf_huge_memory_init);
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 4/4] samples: bpf: add mthp_ext
  2026-05-08 15:00 [PATCH v2 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent Vernon Yang
                   ` (2 preceding siblings ...)
  2026-05-08 15:00 ` [PATCH v2 3/4] mm: introduce bpf_mthp_ops struct ops Vernon Yang
@ 2026-05-08 15:00 ` Vernon Yang
  2026-05-08 15:40   ` bot+bpf-ci
  2026-05-08 15:14 ` [PATCH v2 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent Lorenzo Stoakes
  2026-05-08 16:00 ` Pedro Falcato
  5 siblings, 1 reply; 16+ messages in thread
From: Vernon Yang @ 2026-05-08 15:00 UTC (permalink / raw)
  To: akpm, david, ljs, roman.gushchin, inwardvessel, shakeel.butt, ast,
	daniel, surenb
  Cc: tz2294, baohua, lance.yang, dev.jain, laoar.shao, gutierrez.asier,
	linux-kernel, linux-mm, bpf, Vernon Yang

From: Vernon Yang <yanglincheng@kylinos.cn>

Design mthp_ext case to address real workload issues.

The main functions of the mthp_ext are as follows:

- When sub-cgroup is under high memory pressure (default, full 100ms 1s),
  it will automatically fallback to using 4KB.
- When the anon+shmem memory usage of sub-cgroup falls below the minimum
  memory (default 16MB), small-memory processes will automatically
  fallback to using 4KB.
- Under normal conditions, when there is no memory pressure and the
  anon+shmem memory usage exceeds the minimum memory, all mTHP sizes
  shall be utilized by kernel.
- Monitor the root-cgroup (/sys/fs/cgroup) directory by default, with
  support for specifying any cgroup directory.

Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
 samples/bpf/.gitignore     |   1 +
 samples/bpf/Makefile       |   7 +-
 samples/bpf/mthp_ext.bpf.c | 148 ++++++++++++++++
 samples/bpf/mthp_ext.c     | 339 +++++++++++++++++++++++++++++++++++++
 samples/bpf/mthp_ext.h     |  30 ++++
 5 files changed, 524 insertions(+), 1 deletion(-)
 create mode 100644 samples/bpf/mthp_ext.bpf.c
 create mode 100644 samples/bpf/mthp_ext.c
 create mode 100644 samples/bpf/mthp_ext.h

diff --git a/samples/bpf/.gitignore b/samples/bpf/.gitignore
index 0002cd359fb1..2a73581876b4 100644
--- a/samples/bpf/.gitignore
+++ b/samples/bpf/.gitignore
@@ -49,3 +49,4 @@ iperf.*
 /vmlinux.h
 /bpftool/
 /libbpf/
+mthp_ext
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 95a4fa1f1e44..357c7d1c45ef 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -37,6 +37,7 @@ tprogs-y += xdp_fwd
 tprogs-y += task_fd_query
 tprogs-y += ibumad
 tprogs-y += hbm
+tprogs-y += mthp_ext
 
 # Libbpf dependencies
 LIBBPF_SRC = $(TOOLS_PATH)/lib/bpf
@@ -122,6 +123,7 @@ always-y += task_fd_query_kern.o
 always-y += ibumad_kern.o
 always-y += hbm_out_kern.o
 always-y += hbm_edt_kern.o
+always-y += mthp_ext.bpf.o
 
 COMMON_CFLAGS = $(TPROGS_USER_CFLAGS)
 TPROGS_LDFLAGS = $(TPROGS_USER_LDFLAGS)
@@ -289,6 +291,8 @@ $(obj)/hbm_out_kern.o: $(src)/hbm.h $(src)/hbm_kern.h
 $(obj)/hbm.o: $(src)/hbm.h
 $(obj)/hbm_edt_kern.o: $(src)/hbm.h $(src)/hbm_kern.h
 
+mthp_ext: $(obj)/mthp_ext.skel.h
+
 # Override includes for xdp_sample_user.o because $(srctree)/usr/include in
 # TPROGS_CFLAGS causes conflicts
 XDP_SAMPLE_CFLAGS += -Wall -O2 \
@@ -347,10 +351,11 @@ $(obj)/%.bpf.o: $(src)/%.bpf.c $(obj)/vmlinux.h $(src)/xdp_sample.bpf.h $(src)/x
 		-I$(LIBBPF_INCLUDE) $(CLANG_SYS_INCLUDES) \
 		-c $(filter %.bpf.c,$^) -o $@
 
-LINKED_SKELS := xdp_router_ipv4.skel.h
+LINKED_SKELS := xdp_router_ipv4.skel.h mthp_ext.skel.h
 clean-files += $(LINKED_SKELS)
 
 xdp_router_ipv4.skel.h-deps := xdp_router_ipv4.bpf.o xdp_sample.bpf.o
+mthp_ext.skel.h-deps := mthp_ext.bpf.o
 
 LINKED_BPF_SRCS := $(patsubst %.bpf.o,%.bpf.c,$(foreach skel,$(LINKED_SKELS),$($(skel)-deps)))
 
diff --git a/samples/bpf/mthp_ext.bpf.c b/samples/bpf/mthp_ext.bpf.c
new file mode 100644
index 000000000000..3524dc45fda4
--- /dev/null
+++ b/samples/bpf/mthp_ext.bpf.c
@@ -0,0 +1,148 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include "mthp_ext.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_core_read.h>
+#include <vdso/bits.h>
+
+struct mem_info {
+	unsigned long long stall;
+	unsigned int order;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_CGRP_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, struct mem_info);
+} cgrp_storage SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_RINGBUF);
+	__uint(max_entries, 256 * 1024);
+} events SEC(".maps");
+
+struct config_local configs;
+
+/*
+ * mthp_choose_impl - Choose the custom mTHP orders, read order from cgrp_storage,
+ *		      which is Adjustment by the cgroup_scan().
+ * @cgrp: control group
+ * @orders: original orders
+ *
+ * Return suited mTHP orders.
+ */
+SEC("struct_ops/mthp_choose")
+unsigned long BPF_PROG(mthp_choose_impl, struct cgroup *cgrp, unsigned long orders)
+{
+	struct mem_info *info;
+	unsigned int order;
+
+	if (configs.fixed) {
+		order = configs.init_order;
+		goto out;
+	}
+
+	info = bpf_cgrp_storage_get(&cgrp_storage, cgrp, 0, 0);
+	if (!info)
+		return orders;
+
+	order = info->order;
+out:
+	if (!order)
+		return 0;
+
+	orders &= BIT(order + 1) - 1;
+	return orders;
+}
+
+SEC(".struct_ops.link")
+struct bpf_mthp_ops mthp_ops = {
+	.mthp_choose = (void *)mthp_choose_impl,
+};
+
+/* backport from kernel/cgroup/cgroup.c */
+static bool cgroup_has_tasks(struct cgroup *cgrp)
+{
+	return cgrp->nr_populated_csets;
+}
+
+/*
+ * cgroup_scan - scan all descendant cgroups under root cgroup.
+ *
+ * 1. When the memory usage of the sub-cgroup falls below the <min> threshold,
+ *    it will automatically fall back to using 4KB size; otherwise, it will
+ *    use all mTHP sizes.
+ * 2. When memory.pressure stall time of the sub-cgroup exceeds <threshold>,
+ *    it will automatically fall back to using 4KB size; otherwise, it will
+ *    use all mTHP sizes.
+ *
+ * Return 1 indicates termination of the iteration loop, and return 0 indicates
+ * iteration to the next sub-cgroup.
+ */
+SEC("iter.s/cgroup")
+int cgroup_scan(struct bpf_iter__cgroup *ctx)
+{
+	struct cgroup *cgrp = ctx->cgroup;
+	struct mem_cgroup *memcg;
+	struct mem_info *info;
+	struct alert_event *e;
+	unsigned long curr_mem;
+	unsigned long long curr_stall;
+	unsigned long long delta;
+
+	if (!cgrp)
+		return 1;
+
+	if (!cgroup_has_tasks(cgrp))
+		return 0;
+
+	info = bpf_cgrp_storage_get(&cgrp_storage, cgrp, 0,
+				    BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (!info)
+		return 0;
+
+	memcg = bpf_get_mem_cgroup(&cgrp->self);
+	if (!memcg)
+		return 0;
+
+	bpf_cgroup_flush_stats(cgrp);
+	curr_stall = bpf_cgroup_stall(cgrp, PSI_MEM_FULL);
+	if (!info->stall) {
+		info->order = configs.init_order;
+		goto UPDATE;
+	}
+	delta = curr_stall - info->stall;
+	bpf_mem_cgroup_flush_stats(memcg);
+	curr_mem = bpf_mem_cgroup_page_state(memcg, NR_ANON_MAPPED) +
+		   bpf_mem_cgroup_page_state(memcg, NR_SHMEM);
+	if ((curr_mem && curr_mem < FROM_MB(configs.min_mem)) ||
+	     delta >= configs.threshold)
+		info->order = 0;
+	else
+		info->order = PMD_ORDER;
+
+	if (configs.debug) {
+		e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
+		if (e) {
+			e->prev_stall = info->stall;
+			e->curr_stall = curr_stall;
+			e->delta = delta;
+			e->mem = curr_mem;
+			e->order = info->order;
+			bpf_probe_read_kernel_str(e->name, sizeof(e->name),
+						  cgrp->kn->name);
+			bpf_ringbuf_submit(e, 0);
+		}
+	}
+
+UPDATE:
+	info->stall = curr_stall;
+	bpf_put_mem_cgroup(memcg);
+
+	return 0;
+}
+
+char LICENSE[] SEC("license") = "GPL";
diff --git a/samples/bpf/mthp_ext.c b/samples/bpf/mthp_ext.c
new file mode 100644
index 000000000000..120c331ff26a
--- /dev/null
+++ b/samples/bpf/mthp_ext.c
@@ -0,0 +1,339 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <string.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <signal.h>
+#include <time.h>
+#include <stdbool.h>
+#include <getopt.h>
+#include <sys/epoll.h>
+#include <sys/stat.h>
+#include <linux/limits.h>
+#include <linux/bpf.h>
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+#include "mthp_ext.h"
+#include "mthp_ext.skel.h"
+
+#define DEFAULT_ROOT		"/sys/fs/cgroup"
+#define DEFAULT_THRESHOLD_MS	100UL
+#define DEFAULT_INTERVAL_MS	1000UL
+#define DEFAULT_ORDER		PMD_ORDER
+#define DEFAULT_MIN_MEM		16
+
+static bool exiting;
+
+static void usage(const char *name)
+{
+	fprintf(stderr,
+		"Usage: %s [OPTIONS]\n\n"
+		"Monitor specified cgroup, adjust mTHP size via cgroup_bpf.\n\n"
+		"Currently supports fixed mTHP size and automatic mTHP size adjustment.\n"
+		"By default, it monitors the entire cgroup and automatically\n"
+		"adjusts mTHP size within the specified time window <interval>.\n"
+		"1. When the memory size of the sub-cgroup falls below\n"
+		"   the <min> threshold, it will automatically fall back to\n"
+		"   using 4KB size; otherwise, it will use all mTHP sizes.\n"
+		"2. When memory.pressure stall time of the sub-cgroup exceeds\n"
+		"   <threshold>, it will automatically fall back to using 4KB\n"
+		"   size; otherwise, it will use all mTHP sizes.\n\n"
+		"Options:\n"
+		"  -r, --root=PATH        Root cgroup path (default: /sys/fs/cgroup)\n"
+		"  -t, --threshold=MS     threshold in ms (default: %lu)\n"
+		"  -i, --interval=MS      interval in ms (default: %lu)\n"
+		"  -o, --order=NR         Initial mthp order (default: %d)\n"
+		"  -m, --min=MB           Minimum memory size for mTHP (default: %d)\n"
+		"  -f, --fixed            Use fixed order, disable auto-adjustment\n"
+		"  -d, --debug            Enable debug output\n"
+		"  -h, --help             Show this help\n",
+		name, DEFAULT_THRESHOLD_MS, DEFAULT_INTERVAL_MS, DEFAULT_ORDER,
+		DEFAULT_MIN_MEM);
+}
+
+static void sig_handler(int sig)
+{
+	exiting = true;
+}
+
+static int setup_psi_trigger(const char *cgroup_path, const char *type,
+			     unsigned long stall_us, unsigned long window_us)
+{
+	char path[PATH_MAX];
+	char trigger[128];
+	int fd, nr;
+
+	snprintf(path, sizeof(path), "%s/memory.pressure", cgroup_path);
+	fd = open(path, O_RDWR | O_NONBLOCK);
+	if (fd < 0) {
+		fprintf(stderr, "ERROR: open PSI file failed\n");
+		return -errno;
+	}
+
+	nr = snprintf(trigger, sizeof(trigger), "%s %lu %lu",
+		      type, stall_us, window_us);
+	if (write(fd, trigger, nr) < 0) {
+		fprintf(stderr, "ERROR: write PSI trigger failed\n");
+		close(fd);
+		return -errno;
+	}
+
+	return fd;
+}
+
+static int trigger_scan(struct bpf_link *iter_link)
+{
+	char buf[256];
+	int fd;
+
+	fd = bpf_iter_create(bpf_link__fd(iter_link));
+	if (fd < 0) {
+		fprintf(stderr, "ERROR: bpf_iter_create failed: %s\n",
+			strerror(errno));
+		return -1;
+	}
+
+	/* Read to trigger the iter program execution */
+	while (read(fd, buf, sizeof(buf)) > 0)
+		;
+
+	close(fd);
+	return 0;
+}
+
+static void *monitor_thread(int psi_fd, struct config_local *configs,
+		struct bpf_link *iter_link, struct ring_buffer *rb)
+{
+	struct epoll_event e;
+	int epoll_fd;
+	int nfds;
+
+	epoll_fd = epoll_create1(0);
+	if (epoll_fd < 0) {
+		fprintf(stderr, "ERROR: epoll_create1 failed\n");
+		return NULL;
+	}
+
+	e.events = EPOLLPRI;
+	e.data.fd = psi_fd;
+	if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, psi_fd, &e)) {
+		fprintf(stderr, "ERROR: epoll_ctl failed\n");
+		goto CLOSE;
+	}
+
+	/* First initialization */
+	trigger_scan(iter_link);
+
+	/* Auto adjustment */
+	while (!exiting) {
+		nfds = epoll_wait(epoll_fd, &e, 1, configs->interval * 2);
+		trigger_scan(iter_link);
+
+		if (configs->debug) {
+			printf("PSI: memory pressure %s\n", nfds ? "high" : "low");
+			ring_buffer__poll(rb, 0);
+		}
+	}
+
+CLOSE:
+	close(epoll_fd);
+	return NULL;
+}
+
+static int handle_event(void *ctx, void *data, size_t len)
+{
+	struct alert_event *e = data;
+
+	printf("cgroup %s: stall %llu -> %llu (+%llu), mem %luMB, mthp order=%d\n",
+		e->name[0] ? e->name : "/",
+		e->prev_stall, e->curr_stall, e->delta, TO_MB(e->mem), e->order);
+
+	return 0;
+}
+
+int main(int argc, char **argv)
+{
+	const char *root_path = DEFAULT_ROOT;
+	unsigned long threshold = DEFAULT_THRESHOLD_MS;
+	unsigned long interval = DEFAULT_INTERVAL_MS;
+	unsigned int init_order = DEFAULT_ORDER;
+	unsigned int min_mem = DEFAULT_MIN_MEM;
+	bool fixed = false;
+	bool debug = false;
+	struct mthp_ext *skel;
+	struct bpf_link *iter_link;
+	struct bpf_link *ops_link;
+	struct ring_buffer *rb;
+	int root_fd;
+	int psi_fd;
+	int err = 0;
+	int opt;
+
+	static struct option long_options[] = {
+		{"root",       required_argument, 0, 'r'},
+		{"threshold",  required_argument, 0, 't'},
+		{"interval",   required_argument, 0, 'i'},
+		{"order",      required_argument, 0, 'o'},
+		{"min",        required_argument, 0, 'm'},
+		{"fixed",      no_argument,       0, 'f'},
+		{"debug",      no_argument,       0, 'd'},
+		{"help",       no_argument,       0, 'h'},
+		{0, 0, 0, 0}
+	};
+
+	while ((opt = getopt_long(argc, argv, "r:t:i:o:m:fdh",
+				  long_options, NULL)) != -1) {
+		switch (opt) {
+		case 'r':
+			root_path = optarg;
+			break;
+		case 't':
+			threshold = strtoul(optarg, NULL, 10);
+			break;
+		case 'i':
+			interval = strtoul(optarg, NULL, 10);
+			break;
+		case 'o':
+			init_order = min(strtoul(optarg, NULL, 10), PMD_ORDER);
+			break;
+		case 'm':
+			min_mem = strtoul(optarg, NULL, 10);
+			break;
+		case 'f':
+			fixed = true;
+			break;
+		case 'd':
+			debug = true;
+			break;
+		case 'h':
+			usage(argv[0]);
+			return 0;
+		default:
+			usage(argv[0]);
+			return -EINVAL;
+		}
+	}
+
+	if (!threshold || !interval) {
+		fprintf(stderr, "ERROR: threshold and interval must be > 0\n");
+		usage(argv[0]);
+		return -EINVAL;
+	}
+
+	signal(SIGINT, sig_handler);
+	signal(SIGTERM, sig_handler);
+
+	root_fd = open(root_path, O_RDONLY);
+	if (root_fd < 0) {
+		fprintf(stderr, "ERROR: open '%s' failed: %s\n",
+			root_path, strerror(errno));
+		return -errno;
+	}
+
+	skel = mthp_ext__open();
+	if (!skel) {
+		fprintf(stderr, "ERROR: failed to open BPF skeleton\n");
+		err = -ENOMEM;
+		goto open_skel_fail;
+	}
+
+	skel->bss->configs.threshold = threshold;
+	skel->bss->configs.interval = interval;
+	skel->bss->configs.init_order = init_order;
+	skel->bss->configs.min_mem = min_mem;
+	skel->bss->configs.fixed = fixed;
+	skel->bss->configs.debug = debug;
+
+	err = mthp_ext__load(skel);
+	if (err) {
+		fprintf(stderr, "ERROR: failed to load BPF program: %d\n", err);
+		goto load_skel_fail;
+	}
+
+	/* Attach struct_ops to root cgroup for mthp_choose */
+	DECLARE_LIBBPF_OPTS(bpf_struct_ops_opts, opts);
+	opts.flags = BPF_F_CGROUP_FD;
+	opts.target_fd = root_fd;
+	ops_link = bpf_map__attach_struct_ops_opts(skel->maps.mthp_ops, &opts);
+	err = libbpf_get_error(ops_link);
+	if (err) {
+		fprintf(stderr, "ERROR: attach struct_ops failed: %d\n", err);
+		ops_link = NULL;
+		goto attach_opts_fail;
+	}
+
+	printf("Monitoring         : %s\n"
+	       "threshold          : %lums\n"
+	       "Interval           : %lums\n"
+	       "Initial order      : %d%s\n"
+	       "min memory         : %dMB\n"
+	       "Debug              : %s\n"
+	       "Press Ctrl+C to exit.\n\n",
+	       root_path, threshold, interval, init_order,
+	       fixed ? " (fixed)" : " (auto)", min_mem,
+	       debug ? "on" : "off");
+
+	if (fixed) {
+		while (!exiting)
+			usleep(interval * 1000);
+		goto exit_fixed;
+	}
+
+	/* Auto adjustment, attach cgroup iter for scanning root + descendants */
+	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, iter_opts);
+	union bpf_iter_link_info linfo = {
+		.cgroup.cgroup_fd = root_fd,
+		.cgroup.order = BPF_CGROUP_ITER_DESCENDANTS_PRE,
+	};
+	iter_opts.link_info = &linfo;
+	iter_opts.link_info_len = sizeof(linfo);
+	iter_link = bpf_program__attach_iter(skel->progs.cgroup_scan, &iter_opts);
+	err = libbpf_get_error(iter_link);
+	if (err) {
+		fprintf(stderr, "ERROR: attach cgroup iter failed: %d\n", err);
+		iter_link = NULL;
+		goto attach_iter_fail;
+	}
+
+	/* Set up ring buffer for receiving alerts */
+	rb = ring_buffer__new(bpf_map__fd(skel->maps.events),
+			      handle_event, NULL, NULL);
+	if (!rb) {
+		fprintf(stderr, "ERROR: failed to create ring buffer\n");
+		err = -ENOMEM;
+		goto rb_fail;
+	}
+
+
+	psi_fd = setup_psi_trigger(root_path, "some", threshold * 1000,
+				   interval * 1000);
+	if (psi_fd < 0) {
+		fprintf(stderr, "ERROR: PSI trigger setup failed\n");
+		err = -EINVAL;
+		goto psi_setup_fail;
+	}
+
+	monitor_thread(psi_fd, &skel->bss->configs, iter_link, rb);
+
+	close(psi_fd);
+psi_setup_fail:
+	ring_buffer__free(rb);
+rb_fail:
+	bpf_link__destroy(iter_link);
+exit_fixed:
+attach_iter_fail:
+	bpf_link__destroy(ops_link);
+attach_opts_fail:
+load_skel_fail:
+	mthp_ext__destroy(skel);
+open_skel_fail:
+	close(root_fd);
+
+	printf("\nExiting...\n");
+
+	return err;
+}
diff --git a/samples/bpf/mthp_ext.h b/samples/bpf/mthp_ext.h
new file mode 100644
index 000000000000..e29d80aa15bf
--- /dev/null
+++ b/samples/bpf/mthp_ext.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __MTHP_EXT_H__
+#define __MTHP_EXT_H__
+
+#define CGROUP_NAME_LEN 128
+#define PMD_ORDER	9
+#define min(a, b)	((a) < (b) ? a : b)
+#define FROM_MB(s)	(s * 1024UL * 1024UL)
+#define TO_MB(s)	(s / 1024UL / 1024UL)
+
+struct config_local {
+	unsigned long threshold;
+	unsigned long interval;
+	unsigned int  init_order;
+	unsigned int  min_mem;
+	bool fixed;
+	bool debug;
+};
+
+struct alert_event {
+	unsigned long long prev_stall;
+	unsigned long long curr_stall;
+	unsigned long long delta;
+	unsigned long mem;
+	unsigned int  order;
+	char name[CGROUP_NAME_LEN];
+};
+
+#endif /* __MTHP_EXT_H__ */
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent
  2026-05-08 15:00 [PATCH v2 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent Vernon Yang
                   ` (3 preceding siblings ...)
  2026-05-08 15:00 ` [PATCH v2 4/4] samples: bpf: add mthp_ext Vernon Yang
@ 2026-05-08 15:14 ` Lorenzo Stoakes
  2026-05-08 16:05   ` Lorenzo Stoakes
  2026-05-08 16:00 ` Pedro Falcato
  5 siblings, 1 reply; 16+ messages in thread
From: Lorenzo Stoakes @ 2026-05-08 15:14 UTC (permalink / raw)
  To: Vernon Yang
  Cc: akpm, david, roman.gushchin, inwardvessel, shakeel.butt, ast,
	daniel, surenb, tz2294, baohua, lance.yang, dev.jain, laoar.shao,
	gutierrez.asier, linux-kernel, linux-mm, bpf, Vernon Yang

Thanks for the series, but overall it's got to be no to this until THP and mTHP
are in more stable shape.

And this is an RFC, you're trying to make really fundamental changes here, it's
almost... rude to do that out of the blue non-RFC'd (unless you're a maintainer
perhaps).

Right now the THP code base is a total mess and mTHP support is not even
properly merged yet (khugepaged support outstanding).

BPF interfaces are permanent, we've tried the 'experimental' thing before, it
doesn't work and we'll not be able to yank it later.

I've said it before, but we really truly need to get THP into better shape
before we can tolerate large new changes, let alone an user-exported interface.

So can we defer this until we're in better shape, and then send that as an RFC
first please?

On Fri, May 08, 2026 at 11:00:51PM +0800, Vernon Yang wrote:
> From: Vernon Yang <yanglincheng@kylinos.cn>
>
> Hi all,
>
> Background
> ==========
>
> As is well known, a system can simultaneously run multiple different
> scenarios. However, THP is not beneficial in every scenario — it is only
> most suitable for memory-intensive applications that are not sensitive
> to tail latency. For example, Redis, which is sensitive to tail latency,
> is not suitable for THP. But in practice, due to Redis issues, the
> entire THP functionality is often turned off, preventing other scenarios
> from benefiting from it.
>
> There are also some embedded scenarios (e.g. Android) that directly use
> 2MB THP, where the granularity is too large. Therefore, we introduced
> mTHP in v6.8, which supports multiple-size THP. In practice, however, we
> still globally fix a single mTHP size and are unable to automatically
> select different mTHP sizes based on different scenarios.
>
> After testing, it was found that
>
> - When the system has a lot of free memory, it is normal for Redis to
>   use mTHP. performance degradation in Redis only occurs when the system
>   is under high memory pressure.
> - Additionally, when a large number of small-memory processes use mTHP,
>   memory waste is prone to occur, and performance degradation may also
>   happen during fast memory allocation/release.
>
> Previously, "Cgroup-based THP control"[1] was proposed, but it had the
> following issues.
>
> - It breaks the cgroup hierarchy property.
> - Add new THP knobs, making sysadmin's job more complex
>
> Previously, "mm, bpf: BPF-MM, BPF-THP"[2] was proposed, but it had the
> following issues.
>
> - It didn't address the issue on the per-process mode.
> - For global mode, the prctl(PR_SET_THP_DISABLE) has already achieved
>   the same objective, there is no need to add two mechanisms for the
>   same purpose.
> - Attaching st_ops to mm_struct, the same issues that cgroup-bpf once
>   faced are likely to arise again, e.g. lifetime of cgroup vs bpf, dying
>   cgroups, wq deadlock, etc. It is recommended to use cgroup-bpf for
>   implementation.
> - Unclear ABI stability guarantees.

Not unclear, any BPF interface is permament.

> - The test cases are too simplistic, lacking eBPF cases similar to real
>   workloads such as sched_ext.
>
> If I miss some thing, please let me know. Thanks!
>
> Solution
> ========
>
> This series will solve all the problems mentioned above.
>
> 1. Using cgroup-bpf to customize mTHP size for different scenarios
> 2. Use a cgroup eBPF program to monitor all sub-cgroups. Sub-cgroups
>    under the same parent-cgroup adopt the same eBPF program. Only multiple
>    sibling-cgroups (where the parent-cgroup has no attached eBPF program)
>    are supported to attach multiple different eBPF programs without
>    breaking the hierarchy property of the cgroup.
> 3. Automatically select different mTHP sizes for different cgroups,
>    let's focus on making them truly transparent.

I don't see how cgroup level control is transparent :) this overall seems like
THP control at cgroup level by the back door, and I thought the cgroup people
were adamently against that.

Personally I think we should actually allow less 'transparent' THP but that's a
debatable subject obviously.

> 4. Design mthp_ext case to address real workload issues and further
>    clear/stabilize the ABI.
>
> The main functions of the mthp_ext are as follows:
>
> - When sub-cgroup is under high memory pressure (default, full 100ms 1s),
>   it will automatically fallback to using 4KB.
> - When the anon+shmem memory usage of sub-cgroup falls below the minimum
>   memory (default 16MB), small-memory processes will automatically
>   fallback to using 4KB.
> - Under normal conditions, when there is no memory pressure and the
>   anon+shmem memory usage exceeds the minimum memory, all mTHP sizes
>   shall be utilized by kernel.
> - Monitor the root-cgroup (/sys/fs/cgroup) directory by default, with
>   support for specifying any cgroup directory.

This seems like something prescriptive rather than 'bpf lets you make a
decision' and cgroup-level THP behaviour changes? It seems really out of scope.

>
> Performance
> ===========
>
> The below is some performance test results, testing on x86_64 machine
> (AMD Ryzen9 9950X 16C32T, 32G memory, 8G zram).
>
> NOTE: The following always/never labels indicate setting all mTHP sizes
> to always/never. Detailed test script reference[4].
>
> redis results
> ~~~~~~~~~~~~~
>
> command: redis-benchmark --csv -r 3000000 -n 3000000 -d 1024 -c 16 -P 32 -t set
>
> When cgroup memory.high=max, no memory pressure, seems only noise level
> changes, mthp_ext no regression.
>
> | redis-noBGSAVE | always      | never                | always+mthp_ext     |
> |----------------|-------------|----------------------|---------------------|
> | rps            | 1431307.083 | 1224004.250 (-14.5%) | 1420053.873 (-0.8%) |
> | avg_latency_ms | 0.216       | 0.256       (-18.5%) | 0.218       (-0.9%) |
> | p95_latency_ms | 0.612       | 0.708       (-15.7%) | 0.615       (-0.5%) |
> | p99_latency_ms | 0.682       | 0.812       (-19.1%) | 0.692       (-1.5%) |
>
> | redis-BGSAVE   | always      | never                | always+mthp_ext    |
> |----------------|-------------|----------------------|--------------------|
> | rps            | 1429093.707 | 1231569.587 (-13.8%) | 1431075.330 (0.1%) |
> | avg_latency_ms | 0.216       | 0.255       (-18.1%) | 0.216       (0.0%) |
> | p95_latency_ms | 0.618       | 0.706       (-14.2%) | 0.615       (0.5%) |
> | p99_latency_ms | 0.684       | 0.823       (-20.3%) | 0.684       (0.0%) |
>
> When cgroup memory.high=2G, high memory pressure, mthp_ext RPS improve by
> 3450%, while significantly reducing the tail latency by 99%.
>
> | redis-noBGSAVE | always    | never                | always+mthp_ext      |
> |----------------|-----------|----------------------|----------------------|
> | rps            | 24932.790 | 976610.893 (3817.0%) | 885337.250 (3450.9%) |
> | avg_latency_ms | 13.173    | 0.326        (97.5%) | 0.367        (97.2%) |
> | p95_latency_ms | 23.028    | 0.786        (96.6%) | 1.511        (93.4%) |
> | p99_latency_ms | 366.762   | 1.183        (99.7%) | 2.975        (99.2%) |
>
> | redis-BGSAVE   | always    | never                 | always+mthp_ext      |
> |----------------|-----------|-----------------------|----------------------|
> | rps            | 50551.567 | 1026720.293 (1931.0%) | 892643.707 (1665.8%) |
> | avg_latency_ms | 6.581     | 0.310         (95.3%) | 0.365        (94.5%) |
> | p95_latency_ms | 16.730    | 0.772         (95.4%) | 1.447        (91.4%) |
> | p99_latency_ms | 311.551   | 1.140         (99.6%) | 2.988        (99.0%) |
>
> unixbench results
> ~~~~~~~~~~~~~~~~~
>
> command: ./Run -c 1 shell8
>
> mthp_ext improved by 5.99%.
>
> | unixbench shell8 | always  | never           | always+mthp_ext |
> |------------------|---------|-----------------|-----------------|
> | Score            | 22916.8 | 24304.0 (6.05%) | 24289.9 (5.99%) |
>
> kernbench results
> ~~~~~~~~~~~~~~~~~
>
> When cgroup memory.high=max, no memory pressure, seems only noise level
> changes, mthp_ext no regression.
>
>                             always                 never               always+mthp_ext
> Amean     user-32    19702.39 (   0.00%)    18428.90 *   6.46%*    19706.73 (  -0.02%)
> Amean     syst-32     1159.55 (   0.00%)     2252.43 * -94.25%*     1177.48 *  -1.55%*
> Amean     elsp-32      703.28 (   0.00%)      699.10 *   0.59%*      703.99 *  -0.10%*
> BAmean-95 user-32    19701.79 (   0.00%)    18425.01 (   6.48%)    19704.78 (  -0.02%)
> BAmean-95 syst-32     1159.43 (   0.00%)     2251.86 ( -94.22%)     1177.03 (  -1.52%)
> BAmean-95 elsp-32      703.24 (   0.00%)      698.99 (   0.61%)      703.88 (  -0.09%)
> BAmean-99 user-32    19701.79 (   0.00%)    18425.01 (   6.48%)    19704.78 (  -0.02%)
> BAmean-99 syst-32     1159.43 (   0.00%)     2251.86 ( -94.22%)     1177.03 (  -1.52%)
> BAmean-99 elsp-32      703.24 (   0.00%)      698.99 (   0.61%)      703.88 (  -0.09%)
>
> When cgroup memory.high=2G, high memory pressure, mthp_ext improved by 26%.
>
>                             always                 never               always+mthp_ext
> Amean     user-32    20250.65 (   0.00%)    18368.91 *   9.29%*    18681.27 *   7.75%*
> Amean     syst-32    12778.56 (   0.00%)     9636.99 *  24.58%*     9392.65 *  26.50%*
> Amean     elsp-32     1377.55 (   0.00%)     1026.10 *  25.51%*     1019.40 *  26.00%*
> BAmean-95 user-32    20233.75 (   0.00%)    18353.57 (   9.29%)    18678.01 (   7.69%)
> BAmean-95 syst-32    12543.21 (   0.00%)     9612.28 (  23.37%)     9386.83 (  25.16%)
> BAmean-95 elsp-32     1367.82 (   0.00%)     1023.75 (  25.15%)     1018.17 (  25.56%)
> BAmean-99 user-32    20233.75 (   0.00%)    18353.57 (   9.29%)    18678.01 (   7.69%)
> BAmean-99 syst-32    12543.21 (   0.00%)     9612.28 (  23.37%)     9386.83 (  25.16%)
> BAmean-99 elsp-32     1367.82 (   0.00%)     1023.75 (  25.15%)     1018.17 (  25.56%)
>
> TODO
> ====
>
> - mthp_ext handles different "enum tva_type" values. For example, for
>   small-memory processes, only 4KB is used in TVA_PAGEFAULT, while
>   TVA_KHUGEPAGED/TVA_FORCED_COLLAPSE continues to collapse all mthp
>   size. Under high memory pressure, only 4KB is used for
>   TVA_PAGEFAULT/TVA_KHUGEPAGED, while TVA_FORCED_COLLAPSE continues to
>   collapse all mthp size.
> - selftest
>
> If there are additional scenarios, please let me know as well, so I can
> conduct further prototype verification tests to make mTHP more
> transparent and further clear/stabilize the BPF-THP ABI.
>
> If any of the above the strategies can be integrated into the kernel,
> please let me know. I would be delighted to incorporate these strategies
> into the kernel.
>
> This series is based on mm-new + "mm: BPF OOM"[3] first four patches.

Again, this really should have been an RFC, a 'TODO' section shouldn't exist in
a non-RFC series.

>
> Thank you very much for your comments and discussions.
>
> [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com
> [2] https://lore.kernel.org/linux-mm/20251026100159.6103-1-laoar.shao@gmail.com
> [3] https://lore.kernel.org/linux-mm/20260127024421.494929-1-roman.gushchin@linux.dev
> [4] https://github.com/vernon2gh/app_and_module/tree/main/mthp_ext
>
> V1 -> V2:
> - Rebase on mm-new, run all performance tests again.
> - Register eBPF programs only when no mthp_ops exists in all sub-cgroup, do not
>   destroy the cgroup hierarchy property.
> - Fix newly created cgroups silently bypass the hierarchical BPF mTHP policy.
> - Fix bpf_mthp_choose() UAF due to improper SRCU locking.
> - Add bounds check in bpf_cgroup_stall() and fix return type to u64.
> - Check cgroup_psi() return value.
> - Fix spurious mTHP fallback during initial cgroup scan due to zero-init
>   info->stall.
> - Fix info->order being set to 0 when no processes are running in the cgroup.
> - Fix Compilation fails when CONFIG_CGROUPS=y && CONFIG_PSI=n.
> - Fix NULL pointer dereference of st_link.
> - FIx infinite loop in trigger_scan() when read() returns an error.
> - Fix integer overflow in FROM_MB() macro.
> - Fix setup_psi_trigger() fail, but masks the error code.
>
> V1 : https://lore.kernel.org/linux-mm/20260503165024.1526680-1-vernon2gm@gmail.com/

All well and good, but I don't see any actual review there, another reason to
send this kind of thing as an RFC first please :)


>
> Vernon Yang (4):
>   psi: add psi_group_flush_stats() function
>   bpf: add bpf_cgroup_{flush_stats,stall} function
>   mm: introduce bpf_mthp_ops struct ops
>   samples: bpf: add mthp_ext
>
>  MAINTAINERS                     |   3 +
>  include/linux/bpf_huge_memory.h |  52 +++++
>  include/linux/cgroup-defs.h     |   1 +
>  include/linux/huge_mm.h         |   6 +
>  include/linux/psi.h             |   5 +
>  kernel/bpf/helpers.c            |  34 ++++
>  kernel/cgroup/cgroup.c          |   2 +
>  kernel/sched/psi.c              |  34 +++-
>  mm/Kconfig                      |  14 ++
>  mm/Makefile                     |   1 +
>  mm/bpf_huge_memory.c            | 168 ++++++++++++++++
>  samples/bpf/.gitignore          |   1 +
>  samples/bpf/Makefile            |   7 +-
>  samples/bpf/mthp_ext.bpf.c      | 148 ++++++++++++++
>  samples/bpf/mthp_ext.c          | 339 ++++++++++++++++++++++++++++++++
>  samples/bpf/mthp_ext.h          |  30 +++
>  16 files changed, 836 insertions(+), 9 deletions(-)
>  create mode 100644 include/linux/bpf_huge_memory.h
>  create mode 100644 mm/bpf_huge_memory.c
>  create mode 100644 samples/bpf/mthp_ext.bpf.c
>  create mode 100644 samples/bpf/mthp_ext.c
>  create mode 100644 samples/bpf/mthp_ext.h
>
> --
> 2.53.0
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/4] psi: add psi_group_flush_stats() function
  2026-05-08 15:00 ` [PATCH v2 1/4] psi: add psi_group_flush_stats() function Vernon Yang
@ 2026-05-08 15:19   ` Lorenzo Stoakes
  0 siblings, 0 replies; 16+ messages in thread
From: Lorenzo Stoakes @ 2026-05-08 15:19 UTC (permalink / raw)
  To: Vernon Yang
  Cc: akpm, david, roman.gushchin, inwardvessel, shakeel.butt, ast,
	daniel, surenb, tz2294, baohua, lance.yang, dev.jain, laoar.shao,
	gutierrez.asier, linux-kernel, linux-mm, bpf, Vernon Yang

On Fri, May 08, 2026 at 11:00:52PM +0800, Vernon Yang wrote:
> From: Vernon Yang <yanglincheng@kylinos.cn>
>
> Add psi_group_flush_stats() function to prepare for the subsequent
> mthp_ext ebpf program.

This isn't a great commit message, you're just saying you're adding a function
then what you plan to use it for, not anything about the why of adding it.

>
> no function changes.
>
> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> ---
>  include/linux/psi.h |  1 +
>  kernel/sched/psi.c  | 34 ++++++++++++++++++++++++++--------
>  2 files changed, 27 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/psi.h b/include/linux/psi.h
> index e0745873e3f2..7b4fd8190810 100644
> --- a/include/linux/psi.h
> +++ b/include/linux/psi.h
> @@ -22,6 +22,7 @@ void psi_init(void);
>  void psi_memstall_enter(unsigned long *flags);
>  void psi_memstall_leave(unsigned long *flags);
>
> +void psi_group_flush_stats(struct psi_group *group);

Feels a bit iffy, exporting an internal management function?

>  int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res);
>  struct psi_trigger *psi_trigger_create(struct psi_group *group, char *buf,
>  				       enum psi_res res, struct file *file,
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index d9c9d9480a45..76ffad90b0b5 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -1242,11 +1242,35 @@ void psi_cgroup_restart(struct psi_group *group)
>  }
>  #endif /* CONFIG_CGROUPS */
>
> +/*
> + * __psi_group_flush_stats - flush the total stall time of a psi group
> + * @group: psi group to flush
> + */
> +static void __psi_group_flush_stats(struct psi_group *group)
> +{
> +	u64 now;
> +
> +	/* Update averages before reporting them */
> +	mutex_lock(&group->avgs_lock);
> +	now = sched_clock();
> +	collect_percpu_times(group, PSI_AVGS, NULL);
> +	if (now >= group->avg_next_update)
> +		group->avg_next_update = update_averages(group, now);
> +	mutex_unlock(&group->avgs_lock);

If we do need to factor this out, maybe worth making the mutex lock/unlock a
guard(mutex)(&group->avgs_lock) instead?

> +}
> +
> +void psi_group_flush_stats(struct psi_group *group)
> +{
> +	if (static_branch_likely(&psi_disabled))
> +		return;

Is it actually likely if you're calling this function?

And the caller doesn't care even if PSI is disabled?

> +
> +	__psi_group_flush_stats(group);
> +}
> +
>  int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
>  {
>  	bool only_full = false;
>  	int full;
> -	u64 now;
>
>  	if (static_branch_likely(&psi_disabled))
>  		return -EOPNOTSUPP;
> @@ -1256,13 +1280,7 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
>  		return -EOPNOTSUPP;
>  #endif
>
> -	/* Update averages before reporting them */
> -	mutex_lock(&group->avgs_lock);
> -	now = sched_clock();
> -	collect_percpu_times(group, PSI_AVGS, NULL);
> -	if (now >= group->avg_next_update)
> -		group->avg_next_update = update_averages(group, now);
> -	mutex_unlock(&group->avgs_lock);
> +	__psi_group_flush_stats(group);
>
>  #ifdef CONFIG_IRQ_TIME_ACCOUNTING
>  	only_full = res == PSI_IRQ;
> --
> 2.53.0
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/4] bpf: add bpf_cgroup_{flush_stats,stall} function
  2026-05-08 15:00 ` [PATCH v2 2/4] bpf: add bpf_cgroup_{flush_stats,stall} function Vernon Yang
@ 2026-05-08 15:40   ` bot+bpf-ci
  0 siblings, 0 replies; 16+ messages in thread
From: bot+bpf-ci @ 2026-05-08 15:40 UTC (permalink / raw)
  To: vernon2gm, akpm, david, ljs, roman.gushchin, inwardvessel,
	shakeel.butt, ast, daniel, surenb
  Cc: tz2294, baohua, lance.yang, dev.jain, laoar.shao, gutierrez.asier,
	linux-kernel, linux-mm, bpf, yanglincheng, ast, andrii, daniel,
	martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 3699 bytes --]

> diff --git a/include/linux/psi.h b/include/linux/psi.h
> index 7b4fd8190810..243dcf97bea4 100644
> --- a/include/linux/psi.h
> +++ b/include/linux/psi.h
> @@ -52,6 +52,10 @@ static inline void psi_memstall_enter(unsigned long *flags) {}
>  static inline void psi_memstall_leave(unsigned long *flags) {}
>
>  #ifdef CONFIG_CGROUPS
> +static inline struct psi_group *cgroup_psi(struct cgroup *cgrp)
> +{
> +	return NULL;
> +}
>  static inline int psi_cgroup_alloc(struct cgroup *cgrp)
>  {
>  	return 0;

[ ... ]

> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index baa12b24bb64..5562a6c9d2f9 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -29,6 +29,7 @@
>  #include <linux/task_work.h>
>  #include <linux/irq_work.h>
>  #include <linux/buildid.h>
> +#include <linux/psi.h>
>
>  #include "../../lib/kstrtox.h"
>

[ ... ]

> @@ -2881,6 +2882,37 @@ bpf_task_get_cgroup1(struct task_struct *task, int hierarchy_id)
>  		return NULL;
>  	return cgrp;
>  }
> +
> +/**
> + * bpf_cgroup_stall - acquire the total stall time of cgroup
> + * @cgrp: cgroup struct
> + * @states: psi states
> + *
> + * Return the total stall time.
> + */
> +__bpf_kfunc u64 bpf_cgroup_stall(struct cgroup *cgrp, enum psi_states states)
> +{
> +	struct psi_group *group = cgroup_psi(cgrp);
> +
> +	if (unlikely(!group || (u32)states >= NR_PSI_STATES - 1))
> +		return (u64)-1;
> +
> +	return div_u64(group->total[PSI_AVGS][states], NSEC_PER_MSEC);
> +}
> +
> +/**
> + * bpf_cgroup_flush_stats - Flush cgroup's statistics
> + * @cgrp: cgroup struct
> + */
> +__bpf_kfunc void bpf_cgroup_flush_stats(struct cgroup *cgrp)
> +{
> +	struct psi_group *group = cgroup_psi(cgrp);
> +
> +	if (unlikely(!group))
> +		return;
> +
> +	psi_group_flush_stats(group);
> +}
>  #endif /* CONFIG_CGROUPS */

Will these functions build when CONFIG_BPF_SYSCALL=y, CONFIG_CGROUPS=y,
but CONFIG_PSI=n?

The functions reference several PSI-only symbols:

In bpf_cgroup_stall():
- enum psi_states (parameter type) only exists in include/linux/psi_types.h
  when CONFIG_PSI=y
- NR_PSI_STATES and PSI_AVGS are enum members only defined when CONFIG_PSI=y
- struct psi_group's total field doesn't exist when CONFIG_PSI=n (psi_types.h
  defines an empty struct in that case)

In bpf_cgroup_flush_stats():
- psi_group_flush_stats() is only declared in include/linux/psi.h when
  CONFIG_PSI=y with no stub for the !CONFIG_PSI case

kernel/bpf/helpers.o is compiled whenever CONFIG_BPF_SYSCALL=y. Since PSI
is an independent config option, the combination CONFIG_BPF_SYSCALL=y &&
CONFIG_CGROUPS=y && CONFIG_PSI=n is valid and would hit undefined symbols.

The cgroup_psi() stub added in include/linux/psi.h only handles the
!CONFIG_PSI case for cgroup_psi() itself, but doesn't solve the unresolved
PSI types and functions. Should these kfuncs be guarded with CONFIG_PSI,
similar to how kernel/cgroup/cgroup.c handles cgroup_pressure_show()?

>
>  /**
> @@ -4734,6 +4766,8 @@ BTF_ID_FLAGS(func, bpf_cgroup_ancestor, KF_ACQUIRE | KF_RCU | KF_RET_NULL)
>  BTF_ID_FLAGS(func, bpf_cgroup_from_id, KF_ACQUIRE | KF_RET_NULL)
>  BTF_ID_FLAGS(func, bpf_task_under_cgroup, KF_RCU)
>  BTF_ID_FLAGS(func, bpf_task_get_cgroup1, KF_ACQUIRE | KF_RCU | KF_RET_NULL)
> +BTF_ID_FLAGS(func, bpf_cgroup_stall)
> +BTF_ID_FLAGS(func, bpf_cgroup_flush_stats, KF_SLEEPABLE)
>  #endif

These BTF_ID_FLAGS entries would also need CONFIG_PSI guards if the function
definitions are moved inside CONFIG_PSI.


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/25563319416

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 4/4] samples: bpf: add mthp_ext
  2026-05-08 15:00 ` [PATCH v2 4/4] samples: bpf: add mthp_ext Vernon Yang
@ 2026-05-08 15:40   ` bot+bpf-ci
  0 siblings, 0 replies; 16+ messages in thread
From: bot+bpf-ci @ 2026-05-08 15:40 UTC (permalink / raw)
  To: vernon2gm, akpm, david, ljs, roman.gushchin, inwardvessel,
	shakeel.butt, ast, daniel, surenb
  Cc: tz2294, baohua, lance.yang, dev.jain, laoar.shao, gutierrez.asier,
	linux-kernel, linux-mm, bpf, yanglincheng, ast, andrii, daniel,
	martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 2847 bytes --]

> diff --git a/samples/bpf/.gitignore b/samples/bpf/.gitignore
> index 0002cd359fb1..2a73581876b4 100644
> --- a/samples/bpf/.gitignore
> +++ b/samples/bpf/.gitignore

[ ... ]

> diff --git a/samples/bpf/mthp_ext.c b/samples/bpf/mthp_ext.c
> new file mode 100644
> index 000000000000..120c331ff26a
> --- /dev/null
> +++ b/samples/bpf/mthp_ext.c

[ ... ]

> @@ -0,0 +1,339 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <stdint.h>
> +#include <string.h>
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <unistd.h>
> +#include <signal.h>
> +#include <time.h>
> +#include <stdbool.h>
> +#include <getopt.h>
> +#include <sys/epoll.h>
> +#include <sys/stat.h>
> +#include <linux/limits.h>
> +#include <linux/bpf.h>
> +#include <bpf/bpf.h>
> +#include <bpf/libbpf.h>
> +#include "mthp_ext.h"
> +#include "mthp_ext.skel.h"

[ ... ]

> +int main(int argc, char **argv)
> +{
> +	const char *root_path = DEFAULT_ROOT;
> +	unsigned long threshold = DEFAULT_THRESHOLD_MS;
> +	unsigned long interval = DEFAULT_INTERVAL_MS;
> +	unsigned int init_order = DEFAULT_ORDER;
> +	unsigned int min_mem = DEFAULT_MIN_MEM;
> +	bool fixed = false;
> +	bool debug = false;
> +	struct mthp_ext *skel;
> +	struct bpf_link *iter_link;
> +	struct bpf_link *ops_link;
> +	struct ring_buffer *rb;
> +	int root_fd;
> +	int psi_fd;
> +	int err = 0;
> +	int opt;

[ ... ]

> +	/* Attach struct_ops to root cgroup for mthp_choose */
> +	DECLARE_LIBBPF_OPTS(bpf_struct_ops_opts, opts);
> +	opts.flags = BPF_F_CGROUP_FD;
> +	opts.target_fd = root_fd;
> +	ops_link = bpf_map__attach_struct_ops_opts(skel->maps.mthp_ops, &opts);
> +	err = libbpf_get_error(ops_link);

Will mthp_ext compile with these symbols?  bot+bpf-ci@kernel.org noted in
v1 that bpf_struct_ops_opts, BPF_F_CGROUP_FD, and
bpf_map__attach_struct_ops_opts() do not exist anywhere in the kernel
source tree:

https://lore.kernel.org/bpf/c13f99d42c434ec73ec1e8cf1141047146ab359e60e56e78e1c32554d6e01e9b@mail.kernel.org/

The struct bpf_struct_ops_opts type is not defined in
tools/lib/bpf/libbpf.h.

The BPF_F_CGROUP_FD flag does not appear in include/uapi/linux/bpf.h or
any other header.

The bpf_map__attach_struct_ops_opts() function is not declared in
libbpf.h or exported in tools/lib/bpf/libbpf.map.

You mentioned in your response that this depends on patches from the 'mm:
BPF OOM' series (PATCH#2), but this dependency was not addressed in v2 -
the same non-existent API calls remain in the code.

Should this series explicitly list the dependency on the 'mm: BPF OOM'
PATCH#2 in the cover letter?

[ ... ]


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/25563319416

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/4] mm: introduce bpf_mthp_ops struct ops
  2026-05-08 15:00 ` [PATCH v2 3/4] mm: introduce bpf_mthp_ops struct ops Vernon Yang
@ 2026-05-08 15:40   ` bot+bpf-ci
  2026-05-08 15:57   ` Lorenzo Stoakes
  2026-05-08 20:54   ` David Hildenbrand (Arm)
  2 siblings, 0 replies; 16+ messages in thread
From: bot+bpf-ci @ 2026-05-08 15:40 UTC (permalink / raw)
  To: vernon2gm, akpm, david, ljs, roman.gushchin, inwardvessel,
	shakeel.butt, ast, daniel, surenb
  Cc: tz2294, baohua, lance.yang, dev.jain, laoar.shao, gutierrez.asier,
	linux-kernel, linux-mm, bpf, yanglincheng, ast, andrii, daniel,
	martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 4644 bytes --]

> diff --git a/MAINTAINERS b/MAINTAINERS
> index 13f49378b157..cad3a87d9873 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS

[ ... ]

> diff --git a/include/linux/bpf_huge_memory.h b/include/linux/bpf_huge_memory.h
> new file mode 100644
> index 000000000000..ffda445c9572
> --- /dev/null
> +++ b/include/linux/bpf_huge_memory.h

[ ... ]

> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index f42563739d2e..78854d0e06ab 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h

[ ... ]

> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 2949e5acff35..80ec622213df 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h

[ ... ]

> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 1f084ee71443..7db83ca81e6e 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c

[ ... ]

> diff --git a/mm/Kconfig b/mm/Kconfig
> index 0a43bb80df4f..f3309376bcc5 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -962,6 +962,20 @@ config NO_PAGE_MAPCOUNT
>
>  	  EXPERIMENTAL because the impact of some changes is still unclear.
>
> +config BPF_TRANSPARENT_HUGEPAGE
> +	bool "BPF-based transparent hugepage (EXPERIMENTAL)"
> +	depends on TRANSPARENT_HUGEPAGE && CGROUP_BPF
> +	help
> +	  Using cgroup-bpf to customize mTHP size for different scenarios,
> +	  automatically select different mTHP sizes for different cgroups,
> +	  let's focus on making them truly transparent.
> +
> +	  This is an experimental feature, that might go away at any time,
> +	  Please do not rely any production environment.
                                  ^^^

This isn't a bug, but there's a grammar issue in the Kconfig help text.
The sentence is missing a preposition and pronoun. It should probably be
"Please do not rely on this in any production environment."

> +
> +	  EXPERIMENTAL because the BPF interface is unstable and may be removed
> +	  at any time.
> +

[ ... ]

> diff --git a/mm/bpf_huge_memory.c b/mm/bpf_huge_memory.c
> new file mode 100644
> index 000000000000..851c6ebe2933
> --- /dev/null
> +++ b/mm/bpf_huge_memory.c
> @@ -0,0 +1,168 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Huge memory related BPF code
> + *
> + * Author: Vernon Yang <yanglincheng@kylinos.cn>
> + */
> +
> +#include <linux/bpf.h>
> +#include <linux/srcu.h>

[ ... ]

> +static int bpf_mthp_ops_reg(void *kdata, struct bpf_link *link)
> +{
> +	struct bpf_struct_ops_link *st_link = (struct bpf_struct_ops_link *)link;
> +	struct bpf_mthp_ops *ops = kdata;
> +	struct cgroup_subsys_state *child;
> +	struct cgroup *cgrp;
> +
> +	if (!link)
> +		return -EOPNOTSUPP;
> +
> +	cgrp = st_link->cgroup;
            ^^^^^^^^^^^^^^^^^^^^

Does struct bpf_struct_ops_link have a cgroup field? Looking at
kernel/bpf/bpf_struct_ops.c, the structure is defined as:

    struct bpf_struct_ops_link {
        struct bpf_link link;
        struct bpf_map __rcu *map;
        wait_queue_head_t wait_hup;
    };

There is no cgroup member. The same issue occurs in bpf_mthp_ops_unreg()
where st_link->cgroup is accessed. This would cause a compilation error
when CONFIG_BPF_TRANSPARENT_HUGEPAGE is enabled.

Were you intending to extend struct bpf_struct_ops_link to support
cgroup-scoped struct_ops, or is there a different mechanism planned for
associating the link with a specific cgroup?

> +	if (!cgrp)
> +		return -EINVAL;
> +
> +	cgroup_lock();
> +	css_for_each_descendant_pre(child, &cgrp->self) {
> +		if (READ_ONCE(child->cgroup->mthp_ops)) {
> +			pr_warn("sub-cgroup has already registered.\n");
> +			cgroup_unlock();
> +			return -EBUSY;
> +		}
> +	}
> +	css_for_each_descendant_pre(child, &cgrp->self)
> +		WRITE_ONCE(child->cgroup->mthp_ops, ops);
> +	cgroup_unlock();
> +
> +	return 0;
> +}
> +
> +static void bpf_mthp_ops_unreg(void *kdata, struct bpf_link *link)
> +{
> +	struct bpf_struct_ops_link *st_link = (struct bpf_struct_ops_link *)link;
> +	struct cgroup_subsys_state *child;
> +	struct cgroup *cgrp;
> +
> +	if (!link)
> +		return;
> +
> +	cgrp = st_link->cgroup;
            ^^^^^^^^^^^^^^^^^^^^

Same issue here - the cgroup field doesn't exist in the struct.

> +	if (!cgrp)
> +		return;
> +
> +	cgroup_lock();
> +	css_for_each_descendant_pre(child, &cgrp->self)
> +		WRITE_ONCE(child->cgroup->mthp_ops, NULL);
> +	cgroup_unlock();
> +
> +	synchronize_srcu(&mthp_bpf_srcu);
> +}

[ ... ]


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/25563319416

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/4] mm: introduce bpf_mthp_ops struct ops
  2026-05-08 15:00 ` [PATCH v2 3/4] mm: introduce bpf_mthp_ops struct ops Vernon Yang
  2026-05-08 15:40   ` bot+bpf-ci
@ 2026-05-08 15:57   ` Lorenzo Stoakes
  2026-05-08 20:54   ` David Hildenbrand (Arm)
  2 siblings, 0 replies; 16+ messages in thread
From: Lorenzo Stoakes @ 2026-05-08 15:57 UTC (permalink / raw)
  To: Vernon Yang
  Cc: akpm, david, roman.gushchin, inwardvessel, shakeel.butt, ast,
	daniel, surenb, tz2294, baohua, lance.yang, dev.jain, laoar.shao,
	gutierrez.asier, linux-kernel, linux-mm, bpf, Vernon Yang

NACK

This patch not only overreaches by fundamentally impacting THP behaviour (which
has NOTHING to do with the subject line), but also, unbelievably, taking control
over this away from the THP maintainers, are you actually serious here?

On Fri, May 08, 2026 at 11:00:54PM +0800, Vernon Yang wrote:
> From: Vernon Yang <yanglincheng@kylinos.cn>
>
> Introducing bpf_mthp_ops enables eBPF programs to register the
> mthp_choose callback function via cgroup-ebpf.
>
> Using cgroup-bpf to customize mTHP size for different scenarios，
> automatically select different mTHP sizes for different cgroups,
> let's focus on making them truly transparent.

Err, wait what? You're both adding a BPF hook and then adding a default policy
change that affects all cgroups anyway?

Or are you not and this message is just wrong (I don't really see how you're
'automatically' doing anything here).

And the commit message is 'add struct opts'?

>
> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>

Please have a bit of a think about how you're approaching this, wait until the
THP code has actually been reworked (you can contribute patches to speed that
up), before even thinking of sending something like this again, and then send it
as an RFC.

> ---
>  MAINTAINERS                     |   3 +
>  include/linux/bpf_huge_memory.h |  52 ++++++++++
>  include/linux/cgroup-defs.h     |   1 +
>  include/linux/huge_mm.h         |   6 ++
>  kernel/cgroup/cgroup.c          |   2 +
>  mm/Kconfig                      |  14 +++
>  mm/Makefile                     |   1 +
>  mm/bpf_huge_memory.c            | 168 ++++++++++++++++++++++++++++++++
>  8 files changed, 247 insertions(+)
>  create mode 100644 include/linux/bpf_huge_memory.h
>  create mode 100644 mm/bpf_huge_memory.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index caaa0d6e6056..f1113eaa1193 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -4887,7 +4887,10 @@ M:	Shakeel Butt <shakeel.butt@linux.dev>
>  L:	bpf@vger.kernel.org
>  L:	linux-mm@kvack.org
>  S:	Maintained
> +F:	include/linux/bpf_huge_memory.h
> +F:	mm/bpf_huge_memory.c

Err what??

You're adding THP-specific behaviour to 'BPF [MEMORY MANAGEMENT EXTENSIONS]'?

I'm sorry but what on earth possessed you to do that?

>  F:	mm/bpf_memcontrol.c
> +F:	samples/bpf/mthp_ext.*
>
>  BPF [MISC]
>  L:	bpf@vger.kernel.org
> diff --git a/include/linux/bpf_huge_memory.h b/include/linux/bpf_huge_memory.h
> new file mode 100644
> index 000000000000..ffda445c9572
> --- /dev/null
> +++ b/include/linux/bpf_huge_memory.h
> @@ -0,0 +1,52 @@
> +/* SPDX-License-Identifier: GPL-2.0+ */
> +
> +#ifndef __BPF_HUGE_MEMORY_H
> +#define __BPF_HUGE_MEMORY_H
> +
> +#include <linux/cgroup-defs.h>
> +
> +/**
> + * struct bpf_mthp_ops - BPF callbacks for mTHP operations
> + * @mthp_choose: Choose the custom mTHP orders
> + *
> + * This structure defines the interface for BPF programs to customize
> + * mTHP behavior through struct_ops programs.
> + */
> +struct bpf_mthp_ops {
> +	unsigned long (*mthp_choose)(struct cgroup *cgrp, unsigned long orders);
> +};
> +
> +#ifdef CONFIG_BPF_TRANSPARENT_HUGEPAGE
> +/**
> + * bpf_mthp_choose - Choose the custom mTHP orders using bpf
> + * @mm: task mm_struct
> + * @orders: original orders
> + *
> + * Return suited mTHP orders.
> + */
> +unsigned long bpf_mthp_choose(struct mm_struct *mm, unsigned long orders);
> +
> +/**
> + * cgroup_bpf_set_mthp_ops - Set sub-cgroup mthp_ops to parent cgroup
> + * @cgrp: want to set mthp_ops of sub-cgroup
> + * @parent: parent cgroup
> + */
> +static inline void cgroup_bpf_set_mthp_ops(struct cgroup *cgrp,
> +					   struct cgroup *parent)
> +{
> +	WRITE_ONCE(cgrp->mthp_ops, parent->mthp_ops);
> +}
> +#else
> +static inline unsigned long bpf_mthp_choose(struct mm_struct *mm,
> +					    unsigned long orders)
> +{
> +	return orders;
> +}
> +static inline void cgroup_bpf_set_mthp_ops(struct cgroup *cgrp,
> +					   struct cgroup *parent)
> +{
> +}
> +#endif /* CONFIG_BPF_TRANSPARENT_HUGEPAGE */

These have the same interface flaws as the original THP BPF work. We don't know
whether we want BPF interfering on this decision and it impacts future
development on this.

> +
> +#endif /* __BPF_HUGE_MEMORY_H */
> +
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index f42563739d2e..78854d0e06ab 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -628,6 +628,7 @@ struct cgroup {
>
>  #ifdef CONFIG_BPF_SYSCALL
>  	struct bpf_local_storage __rcu  *bpf_cgrp_storage;
> +	struct bpf_mthp_ops *mthp_ops;
>  #endif
>  #ifdef CONFIG_EXT_SUB_SCHED
>  	struct scx_sched __rcu *scx_sched;
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 127f9e1e7604..65da35fb0980 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -3,6 +3,7 @@
>  #define _LINUX_HUGE_MM_H
>
>  #include <linux/mm_types.h>
> +#include <linux/bpf_huge_memory.h>
>
>  #include <linux/fs.h> /* only for vma_is_dax() */
>  #include <linux/kobject.h>
> @@ -296,6 +297,11 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>  				       enum tva_type type,
>  				       unsigned long orders)
>  {
> +	/* The eBPF-specified orders overrides which order is selected. */
> +	orders &= bpf_mthp_choose(vma->vm_mm, orders);

OK so every single time we call thp_vma_allowable_orders() we take an SRCU lock,
even if there aren't any BPF hooks?...!

And guess what, nobody in THP can do a damn thing to change it since you took
control of that away from us.

No dude.

> +	if (!orders)
> +		return 0;
> +
>  	/*
>  	 * Optimization to check if required orders are enabled early. Only
>  	 * forced collapse ignores sysfs configs.
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 43adc96c7f1a..1dbef3e8b179 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -5836,6 +5836,8 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
>  	if (ret)
>  		goto out_stat_exit;
>
> +	cgroup_bpf_set_mthp_ops(cgrp, parent);
> +

I'm not loving putting this is a fundamental cgroup function like this.

>  	for (tcgrp = cgrp; tcgrp; tcgrp = cgroup_parent(tcgrp))
>  		cgrp->ancestors[tcgrp->level] = tcgrp;
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 27dc5b0139ba..be49bde783a7 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -949,6 +949,20 @@ config NO_PAGE_MAPCOUNT
>
>  	  EXPERIMENTAL because the impact of some changes is still unclear.
>
> +config BPF_TRANSPARENT_HUGEPAGE
> +	bool "BPF-based transparent hugepage (EXPERIMENTAL)"

Experimental means nothing.

> +	depends on TRANSPARENT_HUGEPAGE && CGROUP_BPF
> +	help
> +	  Using cgroup-bpf to customize mTHP size for different scenarios,
> +	  automatically select different mTHP sizes for different cgroups,
> +	  let's focus on making them truly transparent.
> +
> +	  This is an experimental feature, that might go away at any time,
> +	  Please do not rely any production environment.

That's not how BPF works.

> +
> +	  EXPERIMENTAL because the BPF interface is unstable and may be removed
> +	  at any time.

That's not how BPF works.

Did you even follow what was said on the last THP BPF series?

The interface is permanent, it doesn't matter what experimental labels you put
on it.

> +
>  endif # TRANSPARENT_HUGEPAGE
>
>  # simple helper to make the code a bit easier to read
> diff --git a/mm/Makefile b/mm/Makefile
> index 8ad2ab08244e..b474c21c3253 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -108,6 +108,7 @@ obj-$(CONFIG_MEMCG) += swap_cgroup.o
>  endif
>  ifdef CONFIG_BPF_SYSCALL
>  obj-$(CONFIG_MEMCG) += bpf_memcontrol.o
> +obj-$(CONFIG_BPF_TRANSPARENT_HUGEPAGE) += bpf_huge_memory.o
>  endif
>  obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
>  obj-$(CONFIG_GUP_TEST) += gup_test.o
> diff --git a/mm/bpf_huge_memory.c b/mm/bpf_huge_memory.c
> new file mode 100644
> index 000000000000..851c6ebe2933
> --- /dev/null
> +++ b/mm/bpf_huge_memory.c
> @@ -0,0 +1,168 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Huge memory related BPF code

Honestly reading this is making me a bit... annoyed :)

You seem to be trying to take control of THP away from the THP maintainers and
reviewers who work bloody hard for the community.

I'm sure you don't mean to, but it's not at all welcome!

I'll stop here, this series is a no.

> + *
> + * Author: Vernon Yang <yanglincheng@kylinos.cn>
> + */
> +
> +#include <linux/bpf.h>
> +#include <linux/srcu.h>
> +
> +/* Protects cgrp->mthp_ops pointer for read and write. */
> +DEFINE_SRCU(mthp_bpf_srcu);
> +
> +unsigned long bpf_mthp_choose(struct mm_struct *mm, unsigned long orders)
> +{
> +	struct cgroup *cgrp;
> +	struct mem_cgroup *memcg;
> +	struct bpf_mthp_ops *ops;
> +	int idx;
> +
> +	memcg = get_mem_cgroup_from_mm(mm);
> +	if (!memcg)
> +		return orders;
> +
> +	cgrp = memcg->css.cgroup;
> +
> +	idx = srcu_read_lock(&mthp_bpf_srcu);
> +	ops = READ_ONCE(cgrp->mthp_ops);
> +	if (unlikely(ops && ops->mthp_choose))
> +		orders = ops->mthp_choose(cgrp, orders);
> +	srcu_read_unlock(&mthp_bpf_srcu, idx);
> +
> +	mem_cgroup_put(memcg);
> +
> +	return orders;
> +}
> +
> +static int bpf_mthp_ops_btf_struct_access(struct bpf_verifier_log *log,
> +		const struct bpf_reg_state *reg, int off, int size)
> +{
> +	return -EACCES;
> +}
> +
> +static bool bpf_mthp_ops_is_valid_access(int off, int size, enum bpf_access_type type,
> +		const struct bpf_prog *prog, struct bpf_insn_access_aux *info)
> +{
> +	return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
> +}
> +
> +const struct bpf_verifier_ops bpf_mthp_verifier_ops = {
> +	.get_func_proto = bpf_base_func_proto,
> +	.btf_struct_access = bpf_mthp_ops_btf_struct_access,
> +	.is_valid_access = bpf_mthp_ops_is_valid_access,
> +};
> +
> +static int bpf_mthp_ops_reg(void *kdata, struct bpf_link *link)
> +{
> +	struct bpf_struct_ops_link *st_link = (struct bpf_struct_ops_link *)link;
> +	struct bpf_mthp_ops *ops = kdata;
> +	struct cgroup_subsys_state *child;
> +	struct cgroup *cgrp;
> +
> +	if (!link)
> +		return -EOPNOTSUPP;
> +
> +	cgrp = st_link->cgroup;
> +	if (!cgrp)
> +		return -EINVAL;
> +
> +	cgroup_lock();
> +	css_for_each_descendant_pre(child, &cgrp->self) {
> +		if (READ_ONCE(child->cgroup->mthp_ops)) {
> +			pr_warn("sub-cgroup has already registered.\n");
> +			cgroup_unlock();
> +			return -EBUSY;
> +		}
> +	}
> +	css_for_each_descendant_pre(child, &cgrp->self)
> +		WRITE_ONCE(child->cgroup->mthp_ops, ops);
> +	cgroup_unlock();
> +
> +	return 0;
> +}
> +
> +static void bpf_mthp_ops_unreg(void *kdata, struct bpf_link *link)
> +{
> +	struct bpf_struct_ops_link *st_link = (struct bpf_struct_ops_link *)link;
> +	struct cgroup_subsys_state *child;
> +	struct cgroup *cgrp;
> +
> +	if (!link)
> +		return;
> +
> +	cgrp = st_link->cgroup;
> +	if (!cgrp)
> +		return;
> +
> +	cgroup_lock();
> +	css_for_each_descendant_pre(child, &cgrp->self)
> +		WRITE_ONCE(child->cgroup->mthp_ops, NULL);
> +	cgroup_unlock();
> +
> +	synchronize_srcu(&mthp_bpf_srcu);
> +}
> +
> +static int bpf_mthp_ops_check_member(const struct btf_type *t,
> +				     const struct btf_member *member,
> +				     const struct bpf_prog *prog)
> +{
> +	u32 moff = __btf_member_bit_offset(t, member) / 8;
> +
> +	switch (moff) {
> +	case offsetof(struct bpf_mthp_ops, mthp_choose):
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	if (prog->sleepable)
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
> +static int bpf_mthp_ops_init_member(const struct btf_type *t,
> +				    const struct btf_member *member,
> +				    void *kdata, const void *udata)
> +{
> +	return 0;
> +}
> +
> +static int bpf_mthp_ops_init(struct btf *btf)
> +{
> +	return 0;
> +}
> +
> +static unsigned long cfi_mthp_choose(struct cgroup *cgrp, unsigned long orders)
> +{
> +	return 0;
> +}
> +
> +static struct bpf_mthp_ops cfi_bpf_mthp_ops = {
> +	.mthp_choose = cfi_mthp_choose,
> +};
> +
> +static struct bpf_struct_ops bso_bpf_mthp_ops = {
> +	.verifier_ops = &bpf_mthp_verifier_ops,
> +	.reg = bpf_mthp_ops_reg,
> +	.unreg = bpf_mthp_ops_unreg,
> +	.check_member = bpf_mthp_ops_check_member,
> +	.init_member = bpf_mthp_ops_init_member,
> +	.init = bpf_mthp_ops_init,
> +	.name = "bpf_mthp_ops",
> +	.owner = THIS_MODULE,
> +	.cfi_stubs = &cfi_bpf_mthp_ops,
> +};
> +
> +static int __init bpf_huge_memory_init(void)
> +{
> +	int err;
> +
> +	err = register_bpf_struct_ops(&bso_bpf_mthp_ops, bpf_mthp_ops);
> +	if (err)
> +		pr_warn("Registration of bpf_mthp_ops failed, err %d\n", err);
> +
> +	return err;
> +}
> +late_initcall(bpf_huge_memory_init);
> --
> 2.53.0
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent
  2026-05-08 15:00 [PATCH v2 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent Vernon Yang
                   ` (4 preceding siblings ...)
  2026-05-08 15:14 ` [PATCH v2 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent Lorenzo Stoakes
@ 2026-05-08 16:00 ` Pedro Falcato
  2026-05-08 16:15   ` Lorenzo Stoakes
  5 siblings, 1 reply; 16+ messages in thread
From: Pedro Falcato @ 2026-05-08 16:00 UTC (permalink / raw)
  To: Vernon Yang
  Cc: akpm, david, ljs, roman.gushchin, inwardvessel, shakeel.butt, ast,
	daniel, surenb, tz2294, baohua, lance.yang, dev.jain, laoar.shao,
	gutierrez.asier, linux-kernel, linux-mm, bpf, Vernon Yang

On Fri, May 08, 2026 at 11:00:51PM +0800, Vernon Yang wrote:
> From: Vernon Yang <yanglincheng@kylinos.cn>
> 
> Hi all,
> 
> Background
> ==========
> 
> As is well known, a system can simultaneously run multiple different
> scenarios. However, THP is not beneficial in every scenario — it is only
> most suitable for memory-intensive applications that are not sensitive
> to tail latency. For example, Redis, which is sensitive to tail latency,
> is not suitable for THP. But in practice, due to Redis issues, the
> entire THP functionality is often turned off, preventing other scenarios
> from benefiting from it.
> 
> There are also some embedded scenarios (e.g. Android) that directly use
> 2MB THP, where the granularity is too large. Therefore, we introduced
> mTHP in v6.8, which supports multiple-size THP. In practice, however, we
> still globally fix a single mTHP size and are unable to automatically
> select different mTHP sizes based on different scenarios.
> 
> After testing, it was found that
> 
> - When the system has a lot of free memory, it is normal for Redis to
>   use mTHP. performance degradation in Redis only occurs when the system
>   is under high memory pressure.
> - Additionally, when a large number of small-memory processes use mTHP,
>   memory waste is prone to occur, and performance degradation may also
>   happen during fast memory allocation/release.
> 
> Previously, "Cgroup-based THP control"[1] was proposed, but it had the
> following issues.
> 
> - It breaks the cgroup hierarchy property.
> - Add new THP knobs, making sysadmin's job more complex
> 
> Previously, "mm, bpf: BPF-MM, BPF-THP"[2] was proposed, but it had the
> following issues.
> 
> - It didn't address the issue on the per-process mode.
> - For global mode, the prctl(PR_SET_THP_DISABLE) has already achieved
>   the same objective, there is no need to add two mechanisms for the
>   same purpose.
> - Attaching st_ops to mm_struct, the same issues that cgroup-bpf once
>   faced are likely to arise again, e.g. lifetime of cgroup vs bpf, dying
>   cgroups, wq deadlock, etc. It is recommended to use cgroup-bpf for
>   implementation.
> - Unclear ABI stability guarantees.
> - The test cases are too simplistic, lacking eBPF cases similar to real
>   workloads such as sched_ext.
> 
> If I miss some thing, please let me know. Thanks!
>
<snip> 
> kernbench results
> ~~~~~~~~~~~~~~~~~
> 
> When cgroup memory.high=max, no memory pressure, seems only noise level
> changes, mthp_ext no regression.
> 
>                             always                 never               always+mthp_ext
> Amean     user-32    19702.39 (   0.00%)    18428.90 *   6.46%*    19706.73 (  -0.02%)
> Amean     syst-32     1159.55 (   0.00%)     2252.43 * -94.25%*     1177.48 *  -1.55%*
> Amean     elsp-32      703.28 (   0.00%)      699.10 *   0.59%*      703.99 *  -0.10%*
> BAmean-95 user-32    19701.79 (   0.00%)    18425.01 (   6.48%)    19704.78 (  -0.02%)
> BAmean-95 syst-32     1159.43 (   0.00%)     2251.86 ( -94.22%)     1177.03 (  -1.52%)
> BAmean-95 elsp-32      703.24 (   0.00%)      698.99 (   0.61%)      703.88 (  -0.09%)
> BAmean-99 user-32    19701.79 (   0.00%)    18425.01 (   6.48%)    19704.78 (  -0.02%)
> BAmean-99 syst-32     1159.43 (   0.00%)     2251.86 ( -94.22%)     1177.03 (  -1.52%)
> BAmean-99 elsp-32      703.24 (   0.00%)      698.99 (   0.61%)      703.88 (  -0.09%)
> 
> When cgroup memory.high=2G, high memory pressure, mthp_ext improved by 26%.
> 
>                             always                 never               always+mthp_ext
> Amean     user-32    20250.65 (   0.00%)    18368.91 *   9.29%*    18681.27 *   7.75%*
> Amean     syst-32    12778.56 (   0.00%)     9636.99 *  24.58%*     9392.65 *  26.50%*
> Amean     elsp-32     1377.55 (   0.00%)     1026.10 *  25.51%*     1019.40 *  26.00%*
> BAmean-95 user-32    20233.75 (   0.00%)    18353.57 (   9.29%)    18678.01 (   7.69%)
> BAmean-95 syst-32    12543.21 (   0.00%)     9612.28 (  23.37%)     9386.83 (  25.16%)
> BAmean-95 elsp-32     1367.82 (   0.00%)     1023.75 (  25.15%)     1018.17 (  25.56%)
> BAmean-99 user-32    20233.75 (   0.00%)    18353.57 (   9.29%)    18678.01 (   7.69%)
> BAmean-99 syst-32    12543.21 (   0.00%)     9612.28 (  23.37%)     9386.83 (  25.16%)
> BAmean-99 elsp-32     1367.82 (   0.00%)     1023.75 (  25.15%)     1018.17 (  25.56%)
> 
> TODO
> ====
> 
> - mthp_ext handles different "enum tva_type" values. For example, for
>   small-memory processes, only 4KB is used in TVA_PAGEFAULT, while
>   TVA_KHUGEPAGED/TVA_FORCED_COLLAPSE continues to collapse all mthp
>   size. Under high memory pressure, only 4KB is used for
>   TVA_PAGEFAULT/TVA_KHUGEPAGED, while TVA_FORCED_COLLAPSE continues to
>   collapse all mthp size.
> - selftest
> 
> If there are additional scenarios, please let me know as well, so I can
> conduct further prototype verification tests to make mTHP more
> transparent and further clear/stabilize the BPF-THP ABI.

How is it more transparent if you're essentially adding mTHP
micro-programmability from the user's side? This series makes it
_less_ transparent.

If you actually want to make it more transparent, then I would suggest
improving the heuristics such that (m)THP doesn't churn through memory
on high memory pressure. Or such that it doesn't feel extremely compelled
to place the largest THP it can based on vibes.

-- 
Pedro


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent
  2026-05-08 15:14 ` [PATCH v2 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent Lorenzo Stoakes
@ 2026-05-08 16:05   ` Lorenzo Stoakes
  2026-05-08 16:53     ` Vernon Yang
  0 siblings, 1 reply; 16+ messages in thread
From: Lorenzo Stoakes @ 2026-05-08 16:05 UTC (permalink / raw)
  To: Vernon Yang
  Cc: akpm, david, roman.gushchin, inwardvessel, shakeel.butt, ast,
	daniel, surenb, tz2294, baohua, lance.yang, dev.jain, laoar.shao,
	gutierrez.asier, linux-kernel, linux-mm, bpf, Vernon Yang

On Fri, May 08, 2026 at 04:15:04PM +0100, Lorenzo Stoakes wrote:
> Thanks for the series, but overall it's got to be no to this until THP and mTHP
> are in more stable shape.
>
> And this is an RFC, you're trying to make really fundamental changes here, it's
> almost... rude to do that out of the blue non-RFC'd (unless you're a maintainer
> perhaps).
>
> Right now the THP code base is a total mess and mTHP support is not even
> properly merged yet (khugepaged support outstanding).
>
> BPF interfaces are permanent, we've tried the 'experimental' thing before, it
> doesn't work and we'll not be able to yank it later.
>
> I've said it before, but we really truly need to get THP into better shape
> before we can tolerate large new changes, let alone an user-exported interface.
>
> So can we defer this until we're in better shape, and then send that as an RFC
> first please?

Yeah on second thoughts, NACK and don't send this series again please.

I was already annoyed you'd send something this invasive and massive without an
RFC, but you've also ignored the feedback we gave to the last THP BPF series
while ostensibly claiming to have taken it into account.

And then... I mean seriously... _shamelessly_ trying to take control away from
THP maintainers and reviewers who work bloody hard for this community by parking
code that changes mTHP behaviour in an entirely distinct and unrelated
MAINTAINERS section...!

There's a biweekly THP cabal meeting which you didn't raise this in, you didn't
bring this up at any conference, you didn't send an RFC.

You've sent it too before we even have mTHP khugepaged support merged... or have
really stabilised on how mTHP is supposed to work overall.

And also I have made it really abundantly clear that I want to see the technical
debt _paid down_ before we add anything else major.

And as if that wasn't enough, AI review is finding endless problems with this
series on top of all that.

This is NOT how to engage with upstream. Again, please don't send any more
revisions of this.

And next time _engage with the community_ before proposing something this big. A
[DISCUSSION] email, or an RFC, or in a meeting or at a conference, or even
off-list or on-list mail, something.

Lorenzo

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent
  2026-05-08 16:00 ` Pedro Falcato
@ 2026-05-08 16:15   ` Lorenzo Stoakes
  0 siblings, 0 replies; 16+ messages in thread
From: Lorenzo Stoakes @ 2026-05-08 16:15 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Vernon Yang, akpm, david, roman.gushchin, inwardvessel,
	shakeel.butt, ast, daniel, surenb, tz2294, baohua, lance.yang,
	dev.jain, laoar.shao, gutierrez.asier, linux-kernel, linux-mm,
	bpf, Vernon Yang

On Fri, May 08, 2026 at 05:00:04PM +0100, Pedro Falcato wrote:
> On Fri, May 08, 2026 at 11:00:51PM +0800, Vernon Yang wrote:
> > From: Vernon Yang <yanglincheng@kylinos.cn>
> >
> > Hi all,
> >
> > Background
> > ==========
> >
> > As is well known, a system can simultaneously run multiple different
> > scenarios. However, THP is not beneficial in every scenario — it is only
> > most suitable for memory-intensive applications that are not sensitive
> > to tail latency. For example, Redis, which is sensitive to tail latency,
> > is not suitable for THP. But in practice, due to Redis issues, the
> > entire THP functionality is often turned off, preventing other scenarios
> > from benefiting from it.
> >
> > There are also some embedded scenarios (e.g. Android) that directly use
> > 2MB THP, where the granularity is too large. Therefore, we introduced
> > mTHP in v6.8, which supports multiple-size THP. In practice, however, we
> > still globally fix a single mTHP size and are unable to automatically
> > select different mTHP sizes based on different scenarios.
> >
> > After testing, it was found that
> >
> > - When the system has a lot of free memory, it is normal for Redis to
> >   use mTHP. performance degradation in Redis only occurs when the system
> >   is under high memory pressure.
> > - Additionally, when a large number of small-memory processes use mTHP,
> >   memory waste is prone to occur, and performance degradation may also
> >   happen during fast memory allocation/release.
> >
> > Previously, "Cgroup-based THP control"[1] was proposed, but it had the
> > following issues.
> >
> > - It breaks the cgroup hierarchy property.
> > - Add new THP knobs, making sysadmin's job more complex
> >
> > Previously, "mm, bpf: BPF-MM, BPF-THP"[2] was proposed, but it had the
> > following issues.
> >
> > - It didn't address the issue on the per-process mode.
> > - For global mode, the prctl(PR_SET_THP_DISABLE) has already achieved
> >   the same objective, there is no need to add two mechanisms for the
> >   same purpose.
> > - Attaching st_ops to mm_struct, the same issues that cgroup-bpf once
> >   faced are likely to arise again, e.g. lifetime of cgroup vs bpf, dying
> >   cgroups, wq deadlock, etc. It is recommended to use cgroup-bpf for
> >   implementation.
> > - Unclear ABI stability guarantees.
> > - The test cases are too simplistic, lacking eBPF cases similar to real
> >   workloads such as sched_ext.
> >
> > If I miss some thing, please let me know. Thanks!
> >
> <snip>
> > kernbench results
> > ~~~~~~~~~~~~~~~~~
> >
> > When cgroup memory.high=max, no memory pressure, seems only noise level
> > changes, mthp_ext no regression.
> >
> >                             always                 never               always+mthp_ext
> > Amean     user-32    19702.39 (   0.00%)    18428.90 *   6.46%*    19706.73 (  -0.02%)
> > Amean     syst-32     1159.55 (   0.00%)     2252.43 * -94.25%*     1177.48 *  -1.55%*
> > Amean     elsp-32      703.28 (   0.00%)      699.10 *   0.59%*      703.99 *  -0.10%*
> > BAmean-95 user-32    19701.79 (   0.00%)    18425.01 (   6.48%)    19704.78 (  -0.02%)
> > BAmean-95 syst-32     1159.43 (   0.00%)     2251.86 ( -94.22%)     1177.03 (  -1.52%)
> > BAmean-95 elsp-32      703.24 (   0.00%)      698.99 (   0.61%)      703.88 (  -0.09%)
> > BAmean-99 user-32    19701.79 (   0.00%)    18425.01 (   6.48%)    19704.78 (  -0.02%)
> > BAmean-99 syst-32     1159.43 (   0.00%)     2251.86 ( -94.22%)     1177.03 (  -1.52%)
> > BAmean-99 elsp-32      703.24 (   0.00%)      698.99 (   0.61%)      703.88 (  -0.09%)
> >
> > When cgroup memory.high=2G, high memory pressure, mthp_ext improved by 26%.
> >
> >                             always                 never               always+mthp_ext
> > Amean     user-32    20250.65 (   0.00%)    18368.91 *   9.29%*    18681.27 *   7.75%*
> > Amean     syst-32    12778.56 (   0.00%)     9636.99 *  24.58%*     9392.65 *  26.50%*
> > Amean     elsp-32     1377.55 (   0.00%)     1026.10 *  25.51%*     1019.40 *  26.00%*
> > BAmean-95 user-32    20233.75 (   0.00%)    18353.57 (   9.29%)    18678.01 (   7.69%)
> > BAmean-95 syst-32    12543.21 (   0.00%)     9612.28 (  23.37%)     9386.83 (  25.16%)
> > BAmean-95 elsp-32     1367.82 (   0.00%)     1023.75 (  25.15%)     1018.17 (  25.56%)
> > BAmean-99 user-32    20233.75 (   0.00%)    18353.57 (   9.29%)    18678.01 (   7.69%)
> > BAmean-99 syst-32    12543.21 (   0.00%)     9612.28 (  23.37%)     9386.83 (  25.16%)
> > BAmean-99 elsp-32     1367.82 (   0.00%)     1023.75 (  25.15%)     1018.17 (  25.56%)
> >
> > TODO
> > ====
> >
> > - mthp_ext handles different "enum tva_type" values. For example, for
> >   small-memory processes, only 4KB is used in TVA_PAGEFAULT, while
> >   TVA_KHUGEPAGED/TVA_FORCED_COLLAPSE continues to collapse all mthp
> >   size. Under high memory pressure, only 4KB is used for
> >   TVA_PAGEFAULT/TVA_KHUGEPAGED, while TVA_FORCED_COLLAPSE continues to
> >   collapse all mthp size.
> > - selftest
> >
> > If there are additional scenarios, please let me know as well, so I can
> > conduct further prototype verification tests to make mTHP more
> > transparent and further clear/stabilize the BPF-THP ABI.
>
> How is it more transparent if you're essentially adding mTHP
> micro-programmability from the user's side? This series makes it
> _less_ transparent.
>
> If you actually want to make it more transparent, then I would suggest
> improving the heuristics such that (m)THP doesn't churn through memory
> on high memory pressure. Or such that it doesn't feel extremely compelled
> to place the largest THP it can based on vibes.

I agree but I also don't really want to see anything like that until mTHP is
actually stabilised and the code base is less appalling :)

We've deferred paying down technical debt far too long.

>
> --
> Pedro

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent
  2026-05-08 16:05   ` Lorenzo Stoakes
@ 2026-05-08 16:53     ` Vernon Yang
  0 siblings, 0 replies; 16+ messages in thread
From: Vernon Yang @ 2026-05-08 16:53 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, roman.gushchin, inwardvessel, shakeel.butt, ast,
	daniel, surenb, tz2294, baohua, lance.yang, dev.jain, laoar.shao,
	gutierrez.asier, linux-kernel, linux-mm, bpf, Vernon Yang

On Sat, May 9, 2026 at 12:05 AM Lorenzo Stoakes <ljs@kernel.org> wrote:
>
> On Fri, May 08, 2026 at 04:15:04PM +0100, Lorenzo Stoakes wrote:
> > Thanks for the series, but overall it's got to be no to this until THP and mTHP
> > are in more stable shape.
> >
> > And this is an RFC, you're trying to make really fundamental changes here, it's
> > almost... rude to do that out of the blue non-RFC'd (unless you're a maintainer
> > perhaps).
> >
> > Right now the THP code base is a total mess and mTHP support is not even
> > properly merged yet (khugepaged support outstanding).
> >
> > BPF interfaces are permanent, we've tried the 'experimental' thing before, it
> > doesn't work and we'll not be able to yank it later.
> >
> > I've said it before, but we really truly need to get THP into better shape
> > before we can tolerate large new changes, let alone an user-exported interface.
> >
> > So can we defer this until we're in better shape, and then send that as an RFC
> > first please?
>
> Yeah on second thoughts, NACK and don't send this series again please.
>
> I was already annoyed you'd send something this invasive and massive without an
> RFC, but you've also ignored the feedback we gave to the last THP BPF series
> while ostensibly claiming to have taken it into account.
>
> And then... I mean seriously... _shamelessly_ trying to take control away from
> THP maintainers and reviewers who work bloody hard for this community by parking
> code that changes mTHP behaviour in an entirely distinct and unrelated
> MAINTAINERS section...!
>
> There's a biweekly THP cabal meeting which you didn't raise this in, you didn't
> bring this up at any conference, you didn't send an RFC.
>
> You've sent it too before we even have mTHP khugepaged support merged... or have
> really stabilised on how mTHP is supposed to work overall.
>
> And also I have made it really abundantly clear that I want to see the technical
> debt _paid down_ before we add anything else major.
>
> And as if that wasn't enough, AI review is finding endless problems with this
> series on top of all that.
>
> This is NOT how to engage with upstream. Again, please don't send any more
> revisions of this.
>
> And next time _engage with the community_ before proposing something this big. A
> [DISCUSSION] email, or an RFC, or in a meeting or at a conference, or even
> off-list or on-list mail, something.

Firstly, before mTHP stabilizes and enters better shape, I will not
submit any new version.

Let me clarify a few issues:
1. This is an RFC. I forgot to add it. Sorry.
2. There is only one issue in the AI review; the rest are false
positives (the AI did not find the dependent patch "mm: BPF OOM").
3. Regarding placing bpf_huge_memory.c under "MEMORY MANAGEMENT
EXTENSIONS": I never intended to take control of THP away from
maintainers and reviewers. However, it is still my fault for causing
misunderstanding. Sorry.

Also, I would like to ask: what work on mTHP still needs further
refinement at present? I can help out.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/4] mm: introduce bpf_mthp_ops struct ops
  2026-05-08 15:00 ` [PATCH v2 3/4] mm: introduce bpf_mthp_ops struct ops Vernon Yang
  2026-05-08 15:40   ` bot+bpf-ci
  2026-05-08 15:57   ` Lorenzo Stoakes
@ 2026-05-08 20:54   ` David Hildenbrand (Arm)
  2 siblings, 0 replies; 16+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-08 20:54 UTC (permalink / raw)
  To: Vernon Yang, akpm, ljs, roman.gushchin, inwardvessel,
	shakeel.butt, ast, daniel, surenb
  Cc: tz2294, baohua, lance.yang, dev.jain, laoar.shao, gutierrez.asier,
	linux-kernel, linux-mm, bpf, Vernon Yang

>  
>  #include <linux/fs.h> /* only for vma_is_dax() */
>  #include <linux/kobject.h>
> @@ -296,6 +297,11 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>  				       enum tva_type type,
>  				       unsigned long orders)
>  {
> +	/* The eBPF-specified orders overrides which order is selected. */
> +	orders &= bpf_mthp_choose(vma->vm_mm, orders);
> +	if (!orders)
> +		return 0;
> +

There was some discussion around this in the past: where should we hook into
(e.g., deferred shrinker?), which information should we provide to the hook
(e.g., vma properties?).

We concluded mostly to "we don't know". I know that Rik van Riel wanted to look
into doing this properly, but seems like he got distracted :)

I assume there will be a lwn.net article covering the "BPF in MM" session we had
at LSF/MM just this week.

Conclusion: ABI stability is a headake.

The simplistic approach of deciding an order for the whole MM is very likely not
what we want.

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2026-05-08 20:55 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-08 15:00 [PATCH v2 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent Vernon Yang
2026-05-08 15:00 ` [PATCH v2 1/4] psi: add psi_group_flush_stats() function Vernon Yang
2026-05-08 15:19   ` Lorenzo Stoakes
2026-05-08 15:00 ` [PATCH v2 2/4] bpf: add bpf_cgroup_{flush_stats,stall} function Vernon Yang
2026-05-08 15:40   ` bot+bpf-ci
2026-05-08 15:00 ` [PATCH v2 3/4] mm: introduce bpf_mthp_ops struct ops Vernon Yang
2026-05-08 15:40   ` bot+bpf-ci
2026-05-08 15:57   ` Lorenzo Stoakes
2026-05-08 20:54   ` David Hildenbrand (Arm)
2026-05-08 15:00 ` [PATCH v2 4/4] samples: bpf: add mthp_ext Vernon Yang
2026-05-08 15:40   ` bot+bpf-ci
2026-05-08 15:14 ` [PATCH v2 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent Lorenzo Stoakes
2026-05-08 16:05   ` Lorenzo Stoakes
2026-05-08 16:53     ` Vernon Yang
2026-05-08 16:00 ` Pedro Falcato
2026-05-08 16:15   ` Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox