[PATCH v2 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent

BPF List
 help / color / mirror / Atom feed

* [PATCH v2 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent
@ 2026-05-08 15:00 Vernon Yang
  2026-05-08 15:00 ` [PATCH v2 1/4] psi: add psi_group_flush_stats() function Vernon Yang
                   ` (5 more replies)
  0 siblings, 6 replies; 22+ messages in thread
From: Vernon Yang @ 2026-05-08 15:00 UTC (permalink / raw)
  To: akpm, david, ljs, roman.gushchin, inwardvessel, shakeel.butt, ast,
	daniel, surenb
  Cc: tz2294, baohua, lance.yang, dev.jain, laoar.shao, gutierrez.asier,
	linux-kernel, linux-mm, bpf, Vernon Yang

From: Vernon Yang <yanglincheng@kylinos.cn>

Hi all,

Background
==========

As is well known, a system can simultaneously run multiple different
scenarios. However, THP is not beneficial in every scenario — it is only
most suitable for memory-intensive applications that are not sensitive
to tail latency. For example, Redis, which is sensitive to tail latency,
is not suitable for THP. But in practice, due to Redis issues, the
entire THP functionality is often turned off, preventing other scenarios
from benefiting from it.

There are also some embedded scenarios (e.g. Android) that directly use
2MB THP, where the granularity is too large. Therefore, we introduced
mTHP in v6.8, which supports multiple-size THP. In practice, however, we
still globally fix a single mTHP size and are unable to automatically
select different mTHP sizes based on different scenarios.

After testing, it was found that

- When the system has a lot of free memory, it is normal for Redis to
  use mTHP. performance degradation in Redis only occurs when the system
  is under high memory pressure.
- Additionally, when a large number of small-memory processes use mTHP,
  memory waste is prone to occur, and performance degradation may also
  happen during fast memory allocation/release.

Previously, "Cgroup-based THP control"[1] was proposed, but it had the
following issues.

- It breaks the cgroup hierarchy property.
- Add new THP knobs, making sysadmin's job more complex

Previously, "mm, bpf: BPF-MM, BPF-THP"[2] was proposed, but it had the
following issues.

- It didn't address the issue on the per-process mode.
- For global mode, the prctl(PR_SET_THP_DISABLE) has already achieved
  the same objective, there is no need to add two mechanisms for the
  same purpose.
- Attaching st_ops to mm_struct, the same issues that cgroup-bpf once
  faced are likely to arise again, e.g. lifetime of cgroup vs bpf, dying
  cgroups, wq deadlock, etc. It is recommended to use cgroup-bpf for
  implementation.
- Unclear ABI stability guarantees.
- The test cases are too simplistic, lacking eBPF cases similar to real
  workloads such as sched_ext.

If I miss some thing, please let me know. Thanks!

Solution
========

This series will solve all the problems mentioned above.

1. Using cgroup-bpf to customize mTHP size for different scenarios
2. Use a cgroup eBPF program to monitor all sub-cgroups. Sub-cgroups
   under the same parent-cgroup adopt the same eBPF program. Only multiple
   sibling-cgroups (where the parent-cgroup has no attached eBPF program)
   are supported to attach multiple different eBPF programs without
   breaking the hierarchy property of the cgroup.
3. Automatically select different mTHP sizes for different cgroups,
   let's focus on making them truly transparent.
4. Design mthp_ext case to address real workload issues and further
   clear/stabilize the ABI.

The main functions of the mthp_ext are as follows:

- When sub-cgroup is under high memory pressure (default, full 100ms 1s),
  it will automatically fallback to using 4KB.
- When the anon+shmem memory usage of sub-cgroup falls below the minimum
  memory (default 16MB), small-memory processes will automatically
  fallback to using 4KB.
- Under normal conditions, when there is no memory pressure and the
  anon+shmem memory usage exceeds the minimum memory, all mTHP sizes
  shall be utilized by kernel.
- Monitor the root-cgroup (/sys/fs/cgroup) directory by default, with
  support for specifying any cgroup directory.

Performance
===========

The below is some performance test results, testing on x86_64 machine
(AMD Ryzen9 9950X 16C32T, 32G memory, 8G zram).

NOTE: The following always/never labels indicate setting all mTHP sizes
to always/never. Detailed test script reference[4].

redis results
~~~~~~~~~~~~~

command: redis-benchmark --csv -r 3000000 -n 3000000 -d 1024 -c 16 -P 32 -t set

When cgroup memory.high=max, no memory pressure, seems only noise level
changes, mthp_ext no regression.

| redis-noBGSAVE | always      | never                | always+mthp_ext     |
|----------------|-------------|----------------------|---------------------|
| rps            | 1431307.083 | 1224004.250 (-14.5%) | 1420053.873 (-0.8%) |
| avg_latency_ms | 0.216       | 0.256       (-18.5%) | 0.218       (-0.9%) |
| p95_latency_ms | 0.612       | 0.708       (-15.7%) | 0.615       (-0.5%) |
| p99_latency_ms | 0.682       | 0.812       (-19.1%) | 0.692       (-1.5%) |

| redis-BGSAVE   | always      | never                | always+mthp_ext    |
|----------------|-------------|----------------------|--------------------|
| rps            | 1429093.707 | 1231569.587 (-13.8%) | 1431075.330 (0.1%) |
| avg_latency_ms | 0.216       | 0.255       (-18.1%) | 0.216       (0.0%) |
| p95_latency_ms | 0.618       | 0.706       (-14.2%) | 0.615       (0.5%) |
| p99_latency_ms | 0.684       | 0.823       (-20.3%) | 0.684       (0.0%) |

When cgroup memory.high=2G, high memory pressure, mthp_ext RPS improve by
3450%, while significantly reducing the tail latency by 99%.

| redis-noBGSAVE | always    | never                | always+mthp_ext      |
|----------------|-----------|----------------------|----------------------|
| rps            | 24932.790 | 976610.893 (3817.0%) | 885337.250 (3450.9%) |
| avg_latency_ms | 13.173    | 0.326        (97.5%) | 0.367        (97.2%) |
| p95_latency_ms | 23.028    | 0.786        (96.6%) | 1.511        (93.4%) |
| p99_latency_ms | 366.762   | 1.183        (99.7%) | 2.975        (99.2%) |

| redis-BGSAVE   | always    | never                 | always+mthp_ext      |
|----------------|-----------|-----------------------|----------------------|
| rps            | 50551.567 | 1026720.293 (1931.0%) | 892643.707 (1665.8%) |
| avg_latency_ms | 6.581     | 0.310         (95.3%) | 0.365        (94.5%) |
| p95_latency_ms | 16.730    | 0.772         (95.4%) | 1.447        (91.4%) |
| p99_latency_ms | 311.551   | 1.140         (99.6%) | 2.988        (99.0%) |

unixbench results
~~~~~~~~~~~~~~~~~

command: ./Run -c 1 shell8

mthp_ext improved by 5.99%.

| unixbench shell8 | always  | never           | always+mthp_ext |
|------------------|---------|-----------------|-----------------|
| Score            | 22916.8 | 24304.0 (6.05%) | 24289.9 (5.99%) |

kernbench results
~~~~~~~~~~~~~~~~~

When cgroup memory.high=max, no memory pressure, seems only noise level
changes, mthp_ext no regression.

                            always                 never               always+mthp_ext
Amean     user-32    19702.39 (   0.00%)    18428.90 *   6.46%*    19706.73 (  -0.02%)
Amean     syst-32     1159.55 (   0.00%)     2252.43 * -94.25%*     1177.48 *  -1.55%*
Amean     elsp-32      703.28 (   0.00%)      699.10 *   0.59%*      703.99 *  -0.10%*
BAmean-95 user-32    19701.79 (   0.00%)    18425.01 (   6.48%)    19704.78 (  -0.02%)
BAmean-95 syst-32     1159.43 (   0.00%)     2251.86 ( -94.22%)     1177.03 (  -1.52%)
BAmean-95 elsp-32      703.24 (   0.00%)      698.99 (   0.61%)      703.88 (  -0.09%)
BAmean-99 user-32    19701.79 (   0.00%)    18425.01 (   6.48%)    19704.78 (  -0.02%)
BAmean-99 syst-32     1159.43 (   0.00%)     2251.86 ( -94.22%)     1177.03 (  -1.52%)
BAmean-99 elsp-32      703.24 (   0.00%)      698.99 (   0.61%)      703.88 (  -0.09%)

When cgroup memory.high=2G, high memory pressure, mthp_ext improved by 26%.

                            always                 never               always+mthp_ext
Amean     user-32    20250.65 (   0.00%)    18368.91 *   9.29%*    18681.27 *   7.75%*
Amean     syst-32    12778.56 (   0.00%)     9636.99 *  24.58%*     9392.65 *  26.50%*
Amean     elsp-32     1377.55 (   0.00%)     1026.10 *  25.51%*     1019.40 *  26.00%*
BAmean-95 user-32    20233.75 (   0.00%)    18353.57 (   9.29%)    18678.01 (   7.69%)
BAmean-95 syst-32    12543.21 (   0.00%)     9612.28 (  23.37%)     9386.83 (  25.16%)
BAmean-95 elsp-32     1367.82 (   0.00%)     1023.75 (  25.15%)     1018.17 (  25.56%)
BAmean-99 user-32    20233.75 (   0.00%)    18353.57 (   9.29%)    18678.01 (   7.69%)
BAmean-99 syst-32    12543.21 (   0.00%)     9612.28 (  23.37%)     9386.83 (  25.16%)
BAmean-99 elsp-32     1367.82 (   0.00%)     1023.75 (  25.15%)     1018.17 (  25.56%)

TODO
====

- mthp_ext handles different "enum tva_type" values. For example, for
  small-memory processes, only 4KB is used in TVA_PAGEFAULT, while
  TVA_KHUGEPAGED/TVA_FORCED_COLLAPSE continues to collapse all mthp
  size. Under high memory pressure, only 4KB is used for
  TVA_PAGEFAULT/TVA_KHUGEPAGED, while TVA_FORCED_COLLAPSE continues to
  collapse all mthp size.
- selftest

If there are additional scenarios, please let me know as well, so I can
conduct further prototype verification tests to make mTHP more
transparent and further clear/stabilize the BPF-THP ABI.

If any of the above the strategies can be integrated into the kernel,
please let me know. I would be delighted to incorporate these strategies
into the kernel.

This series is based on mm-new + "mm: BPF OOM"[3] first four patches.

Thank you very much for your comments and discussions.

[1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com
[2] https://lore.kernel.org/linux-mm/20251026100159.6103-1-laoar.shao@gmail.com
[3] https://lore.kernel.org/linux-mm/20260127024421.494929-1-roman.gushchin@linux.dev
[4] https://github.com/vernon2gh/app_and_module/tree/main/mthp_ext

V1 -> V2:
- Rebase on mm-new, run all performance tests again.
- Register eBPF programs only when no mthp_ops exists in all sub-cgroup, do not
  destroy the cgroup hierarchy property.
- Fix newly created cgroups silently bypass the hierarchical BPF mTHP policy.
- Fix bpf_mthp_choose() UAF due to improper SRCU locking.
- Add bounds check in bpf_cgroup_stall() and fix return type to u64.
- Check cgroup_psi() return value.
- Fix spurious mTHP fallback during initial cgroup scan due to zero-init
  info->stall.
- Fix info->order being set to 0 when no processes are running in the cgroup.
- Fix Compilation fails when CONFIG_CGROUPS=y && CONFIG_PSI=n.
- Fix NULL pointer dereference of st_link.
- FIx infinite loop in trigger_scan() when read() returns an error.
- Fix integer overflow in FROM_MB() macro.
- Fix setup_psi_trigger() fail, but masks the error code.

V1 : https://lore.kernel.org/linux-mm/20260503165024.1526680-1-vernon2gm@gmail.com/

Vernon Yang (4):
  psi: add psi_group_flush_stats() function
  bpf: add bpf_cgroup_{flush_stats,stall} function
  mm: introduce bpf_mthp_ops struct ops
  samples: bpf: add mthp_ext

 MAINTAINERS                     |   3 +
 include/linux/bpf_huge_memory.h |  52 +++++
 include/linux/cgroup-defs.h     |   1 +
 include/linux/huge_mm.h         |   6 +
 include/linux/psi.h             |   5 +
 kernel/bpf/helpers.c            |  34 ++++
 kernel/cgroup/cgroup.c          |   2 +
 kernel/sched/psi.c              |  34 +++-
 mm/Kconfig                      |  14 ++
 mm/Makefile                     |   1 +
 mm/bpf_huge_memory.c            | 168 ++++++++++++++++
 samples/bpf/.gitignore          |   1 +
 samples/bpf/Makefile            |   7 +-
 samples/bpf/mthp_ext.bpf.c      | 148 ++++++++++++++
 samples/bpf/mthp_ext.c          | 339 ++++++++++++++++++++++++++++++++
 samples/bpf/mthp_ext.h          |  30 +++
 16 files changed, 836 insertions(+), 9 deletions(-)
 create mode 100644 include/linux/bpf_huge_memory.h
 create mode 100644 mm/bpf_huge_memory.c
 create mode 100644 samples/bpf/mthp_ext.bpf.c
 create mode 100644 samples/bpf/mthp_ext.c
 create mode 100644 samples/bpf/mthp_ext.h

--
2.53.0


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 1/4] psi: add psi_group_flush_stats() function
  2026-05-08 15:00 [PATCH v2 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent Vernon Yang
@ 2026-05-08 15:00 ` Vernon Yang
  2026-05-08 15:19   ` Lorenzo Stoakes
  2026-05-08 21:36   ` sashiko-bot
  2026-05-08 15:00 ` [PATCH v2 2/4] bpf: add bpf_cgroup_{flush_stats,stall} function Vernon Yang
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 22+ messages in thread
From: Vernon Yang @ 2026-05-08 15:00 UTC (permalink / raw)
  To: akpm, david, ljs, roman.gushchin, inwardvessel, shakeel.butt, ast,
	daniel, surenb
  Cc: tz2294, baohua, lance.yang, dev.jain, laoar.shao, gutierrez.asier,
	linux-kernel, linux-mm, bpf, Vernon Yang

From: Vernon Yang <yanglincheng@kylinos.cn>

Add psi_group_flush_stats() function to prepare for the subsequent
mthp_ext ebpf program.

no function changes.

Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
 include/linux/psi.h |  1 +
 kernel/sched/psi.c  | 34 ++++++++++++++++++++++++++--------
 2 files changed, 27 insertions(+), 8 deletions(-)

diff --git a/include/linux/psi.h b/include/linux/psi.h
index e0745873e3f2..7b4fd8190810 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -22,6 +22,7 @@ void psi_init(void);
 void psi_memstall_enter(unsigned long *flags);
 void psi_memstall_leave(unsigned long *flags);
 
+void psi_group_flush_stats(struct psi_group *group);
 int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res);
 struct psi_trigger *psi_trigger_create(struct psi_group *group, char *buf,
 				       enum psi_res res, struct file *file,
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index d9c9d9480a45..76ffad90b0b5 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -1242,11 +1242,35 @@ void psi_cgroup_restart(struct psi_group *group)
 }
 #endif /* CONFIG_CGROUPS */
 
+/*
+ * __psi_group_flush_stats - flush the total stall time of a psi group
+ * @group: psi group to flush
+ */
+static void __psi_group_flush_stats(struct psi_group *group)
+{
+	u64 now;
+
+	/* Update averages before reporting them */
+	mutex_lock(&group->avgs_lock);
+	now = sched_clock();
+	collect_percpu_times(group, PSI_AVGS, NULL);
+	if (now >= group->avg_next_update)
+		group->avg_next_update = update_averages(group, now);
+	mutex_unlock(&group->avgs_lock);
+}
+
+void psi_group_flush_stats(struct psi_group *group)
+{
+	if (static_branch_likely(&psi_disabled))
+		return;
+
+	__psi_group_flush_stats(group);
+}
+
 int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
 {
 	bool only_full = false;
 	int full;
-	u64 now;
 
 	if (static_branch_likely(&psi_disabled))
 		return -EOPNOTSUPP;
@@ -1256,13 +1280,7 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
 		return -EOPNOTSUPP;
 #endif
 
-	/* Update averages before reporting them */
-	mutex_lock(&group->avgs_lock);
-	now = sched_clock();
-	collect_percpu_times(group, PSI_AVGS, NULL);
-	if (now >= group->avg_next_update)
-		group->avg_next_update = update_averages(group, now);
-	mutex_unlock(&group->avgs_lock);
+	__psi_group_flush_stats(group);
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 	only_full = res == PSI_IRQ;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 1/4] psi: add psi_group_flush_stats() function
  2026-05-08 15:00 ` [PATCH v2 1/4] psi: add psi_group_flush_stats() function Vernon Yang
@ 2026-05-08 15:19   ` Lorenzo Stoakes
  2026-05-08 21:36   ` sashiko-bot
  1 sibling, 0 replies; 22+ messages in thread
From: Lorenzo Stoakes @ 2026-05-08 15:19 UTC (permalink / raw)
  To: Vernon Yang
  Cc: akpm, david, roman.gushchin, inwardvessel, shakeel.butt, ast,
	daniel, surenb, tz2294, baohua, lance.yang, dev.jain, laoar.shao,
	gutierrez.asier, linux-kernel, linux-mm, bpf, Vernon Yang

On Fri, May 08, 2026 at 11:00:52PM +0800, Vernon Yang wrote:
> From: Vernon Yang <yanglincheng@kylinos.cn>
>
> Add psi_group_flush_stats() function to prepare for the subsequent
> mthp_ext ebpf program.

This isn't a great commit message, you're just saying you're adding a function
then what you plan to use it for, not anything about the why of adding it.

>
> no function changes.
>
> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
> ---
>  include/linux/psi.h |  1 +
>  kernel/sched/psi.c  | 34 ++++++++++++++++++++++++++--------
>  2 files changed, 27 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/psi.h b/include/linux/psi.h
> index e0745873e3f2..7b4fd8190810 100644
> --- a/include/linux/psi.h
> +++ b/include/linux/psi.h
> @@ -22,6 +22,7 @@ void psi_init(void);
>  void psi_memstall_enter(unsigned long *flags);
>  void psi_memstall_leave(unsigned long *flags);
>
> +void psi_group_flush_stats(struct psi_group *group);

Feels a bit iffy, exporting an internal management function?

>  int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res);
>  struct psi_trigger *psi_trigger_create(struct psi_group *group, char *buf,
>  				       enum psi_res res, struct file *file,
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index d9c9d9480a45..76ffad90b0b5 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -1242,11 +1242,35 @@ void psi_cgroup_restart(struct psi_group *group)
>  }
>  #endif /* CONFIG_CGROUPS */
>
> +/*
> + * __psi_group_flush_stats - flush the total stall time of a psi group
> + * @group: psi group to flush
> + */
> +static void __psi_group_flush_stats(struct psi_group *group)
> +{
> +	u64 now;
> +
> +	/* Update averages before reporting them */
> +	mutex_lock(&group->avgs_lock);
> +	now = sched_clock();
> +	collect_percpu_times(group, PSI_AVGS, NULL);
> +	if (now >= group->avg_next_update)
> +		group->avg_next_update = update_averages(group, now);
> +	mutex_unlock(&group->avgs_lock);

If we do need to factor this out, maybe worth making the mutex lock/unlock a
guard(mutex)(&group->avgs_lock) instead?

> +}
> +
> +void psi_group_flush_stats(struct psi_group *group)
> +{
> +	if (static_branch_likely(&psi_disabled))
> +		return;

Is it actually likely if you're calling this function?

And the caller doesn't care even if PSI is disabled?

> +
> +	__psi_group_flush_stats(group);
> +}
> +
>  int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
>  {
>  	bool only_full = false;
>  	int full;
> -	u64 now;
>
>  	if (static_branch_likely(&psi_disabled))
>  		return -EOPNOTSUPP;
> @@ -1256,13 +1280,7 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
>  		return -EOPNOTSUPP;
>  #endif
>
> -	/* Update averages before reporting them */
> -	mutex_lock(&group->avgs_lock);
> -	now = sched_clock();
> -	collect_percpu_times(group, PSI_AVGS, NULL);
> -	if (now >= group->avg_next_update)
> -		group->avg_next_update = update_averages(group, now);
> -	mutex_unlock(&group->avgs_lock);
> +	__psi_group_flush_stats(group);
>
>  #ifdef CONFIG_IRQ_TIME_ACCOUNTING
>  	only_full = res == PSI_IRQ;
> --
> 2.53.0
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 1/4] psi: add psi_group_flush_stats() function
  2026-05-08 15:00 ` [PATCH v2 1/4] psi: add psi_group_flush_stats() function Vernon Yang
  2026-05-08 15:19   ` Lorenzo Stoakes
@ 2026-05-08 21:36   ` sashiko-bot
  1 sibling, 0 replies; 22+ messages in thread
From: sashiko-bot @ 2026-05-08 21:36 UTC (permalink / raw)
  To: Vernon Yang; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Low] Build failure when CONFIG_CGROUPS is enabled but CONFIG_PSI is disabled.
--

commit 3c39f362bc58a25b857026542cbe7d4bb70721fd
Author: Vernon Yang <yanglincheng@kylinos.cn>

psi: add psi_group_flush_stats() function

This commit adds the psi_group_flush_stats() function to prepare for
the subsequent mthp_ext ebpf program.

> diff --git a/include/linux/psi.h b/include/linux/psi.h
> index e0745873e3f26..7b4fd8190810f 100644
> --- a/include/linux/psi.h
> +++ b/include/linux/psi.h
> @@ -22,6 +22,7 @@ void psi_init(void);
>  void psi_memstall_enter(unsigned long *flags);
>  void psi_memstall_leave(unsigned long *flags);
>  
> +void psi_group_flush_stats(struct psi_group *group);

Will this cause a build failure if the kernel is configured with
CONFIG_CGROUPS=y but CONFIG_PSI=n?

Because there is no fallback stub or empty definition for
psi_group_flush_stats() when CONFIG_PSI is disabled, callers added in
subsequent patches under CONFIG_CGROUPS might face undefined references.

Would it make sense to add an empty inline stub in the #else branch
for CONFIG_PSI?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260508150055.680136-1-vernon2gm@gmail.com?part=1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 2/4] bpf: add bpf_cgroup_{flush_stats,stall} function
  2026-05-08 15:00 [PATCH v2 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent Vernon Yang
  2026-05-08 15:00 ` [PATCH v2 1/4] psi: add psi_group_flush_stats() function Vernon Yang
@ 2026-05-08 15:00 ` Vernon Yang
  2026-05-08 15:40   ` bot+bpf-ci
  2026-05-08 22:01   ` sashiko-bot
  2026-05-08 15:00 ` [PATCH v2 3/4] mm: introduce bpf_mthp_ops struct ops Vernon Yang
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 22+ messages in thread
From: Vernon Yang @ 2026-05-08 15:00 UTC (permalink / raw)
  To: akpm, david, ljs, roman.gushchin, inwardvessel, shakeel.butt, ast,
	daniel, surenb
  Cc: tz2294, baohua, lance.yang, dev.jain, laoar.shao, gutierrez.asier,
	linux-kernel, linux-mm, bpf, Vernon Yang

From: Vernon Yang <yanglincheng@kylinos.cn>

Add bpf_cgroup_{flush_stats,stall} function to prepare for the
subsequent mthp_ext ebpf program.

no function changes.

Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
 include/linux/psi.h  |  4 ++++
 kernel/bpf/helpers.c | 34 ++++++++++++++++++++++++++++++++++
 2 files changed, 38 insertions(+)

diff --git a/include/linux/psi.h b/include/linux/psi.h
index 7b4fd8190810..243dcf97bea4 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -52,6 +52,10 @@ static inline void psi_memstall_enter(unsigned long *flags) {}
 static inline void psi_memstall_leave(unsigned long *flags) {}
 
 #ifdef CONFIG_CGROUPS
+static inline struct psi_group *cgroup_psi(struct cgroup *cgrp)
+{
+	return NULL;
+}
 static inline int psi_cgroup_alloc(struct cgroup *cgrp)
 {
 	return 0;
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 2bb60200c266..1c353e0ff14f 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -29,6 +29,7 @@
 #include <linux/task_work.h>
 #include <linux/irq_work.h>
 #include <linux/buildid.h>
+#include <linux/psi.h>
 
 #include "../../lib/kstrtox.h"
 
@@ -2881,6 +2882,37 @@ bpf_task_get_cgroup1(struct task_struct *task, int hierarchy_id)
 		return NULL;
 	return cgrp;
 }
+
+/**
+ * bpf_cgroup_stall - acquire the total stall time of cgroup
+ * @cgrp: cgroup struct
+ * @states: psi states
+ *
+ * Return the total stall time.
+ */
+__bpf_kfunc u64 bpf_cgroup_stall(struct cgroup *cgrp, enum psi_states states)
+{
+	struct psi_group *group = cgroup_psi(cgrp);
+
+	if (unlikely(!group || (u32)states >= NR_PSI_STATES - 1))
+		return (u64)-1;
+
+	return div_u64(group->total[PSI_AVGS][states], NSEC_PER_MSEC);
+}
+
+/**
+ * bpf_cgroup_flush_stats - Flush cgroup's statistics
+ * @cgrp: cgroup struct
+ */
+__bpf_kfunc void bpf_cgroup_flush_stats(struct cgroup *cgrp)
+{
+	struct psi_group *group = cgroup_psi(cgrp);
+
+	if (unlikely(!group))
+		return;
+
+	psi_group_flush_stats(group);
+}
 #endif /* CONFIG_CGROUPS */
 
 /**
@@ -4734,6 +4766,8 @@ BTF_ID_FLAGS(func, bpf_cgroup_ancestor, KF_ACQUIRE | KF_RCU | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_cgroup_from_id, KF_ACQUIRE | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_task_under_cgroup, KF_RCU)
 BTF_ID_FLAGS(func, bpf_task_get_cgroup1, KF_ACQUIRE | KF_RCU | KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_cgroup_stall)
+BTF_ID_FLAGS(func, bpf_cgroup_flush_stats, KF_SLEEPABLE)
 #endif
 BTF_ID_FLAGS(func, bpf_task_from_pid, KF_ACQUIRE | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_task_from_vpid, KF_ACQUIRE | KF_RET_NULL)
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/4] bpf: add bpf_cgroup_{flush_stats,stall} function
  2026-05-08 15:00 ` [PATCH v2 2/4] bpf: add bpf_cgroup_{flush_stats,stall} function Vernon Yang
@ 2026-05-08 15:40   ` bot+bpf-ci
  2026-05-08 22:01   ` sashiko-bot
  1 sibling, 0 replies; 22+ messages in thread
From: bot+bpf-ci @ 2026-05-08 15:40 UTC (permalink / raw)
  To: vernon2gm, akpm, david, ljs, roman.gushchin, inwardvessel,
	shakeel.butt, ast, daniel, surenb
  Cc: tz2294, baohua, lance.yang, dev.jain, laoar.shao, gutierrez.asier,
	linux-kernel, linux-mm, bpf, yanglincheng, ast, andrii, daniel,
	martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 3699 bytes --]

> diff --git a/include/linux/psi.h b/include/linux/psi.h
> index 7b4fd8190810..243dcf97bea4 100644
> --- a/include/linux/psi.h
> +++ b/include/linux/psi.h
> @@ -52,6 +52,10 @@ static inline void psi_memstall_enter(unsigned long *flags) {}
>  static inline void psi_memstall_leave(unsigned long *flags) {}
>
>  #ifdef CONFIG_CGROUPS
> +static inline struct psi_group *cgroup_psi(struct cgroup *cgrp)
> +{
> +	return NULL;
> +}
>  static inline int psi_cgroup_alloc(struct cgroup *cgrp)
>  {
>  	return 0;

[ ... ]

> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index baa12b24bb64..5562a6c9d2f9 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c
> @@ -29,6 +29,7 @@
>  #include <linux/task_work.h>
>  #include <linux/irq_work.h>
>  #include <linux/buildid.h>
> +#include <linux/psi.h>
>
>  #include "../../lib/kstrtox.h"
>

[ ... ]

> @@ -2881,6 +2882,37 @@ bpf_task_get_cgroup1(struct task_struct *task, int hierarchy_id)
>  		return NULL;
>  	return cgrp;
>  }
> +
> +/**
> + * bpf_cgroup_stall - acquire the total stall time of cgroup
> + * @cgrp: cgroup struct
> + * @states: psi states
> + *
> + * Return the total stall time.
> + */
> +__bpf_kfunc u64 bpf_cgroup_stall(struct cgroup *cgrp, enum psi_states states)
> +{
> +	struct psi_group *group = cgroup_psi(cgrp);
> +
> +	if (unlikely(!group || (u32)states >= NR_PSI_STATES - 1))
> +		return (u64)-1;
> +
> +	return div_u64(group->total[PSI_AVGS][states], NSEC_PER_MSEC);
> +}
> +
> +/**
> + * bpf_cgroup_flush_stats - Flush cgroup's statistics
> + * @cgrp: cgroup struct
> + */
> +__bpf_kfunc void bpf_cgroup_flush_stats(struct cgroup *cgrp)
> +{
> +	struct psi_group *group = cgroup_psi(cgrp);
> +
> +	if (unlikely(!group))
> +		return;
> +
> +	psi_group_flush_stats(group);
> +}
>  #endif /* CONFIG_CGROUPS */

Will these functions build when CONFIG_BPF_SYSCALL=y, CONFIG_CGROUPS=y,
but CONFIG_PSI=n?

The functions reference several PSI-only symbols:

In bpf_cgroup_stall():
- enum psi_states (parameter type) only exists in include/linux/psi_types.h
  when CONFIG_PSI=y
- NR_PSI_STATES and PSI_AVGS are enum members only defined when CONFIG_PSI=y
- struct psi_group's total field doesn't exist when CONFIG_PSI=n (psi_types.h
  defines an empty struct in that case)

In bpf_cgroup_flush_stats():
- psi_group_flush_stats() is only declared in include/linux/psi.h when
  CONFIG_PSI=y with no stub for the !CONFIG_PSI case

kernel/bpf/helpers.o is compiled whenever CONFIG_BPF_SYSCALL=y. Since PSI
is an independent config option, the combination CONFIG_BPF_SYSCALL=y &&
CONFIG_CGROUPS=y && CONFIG_PSI=n is valid and would hit undefined symbols.

The cgroup_psi() stub added in include/linux/psi.h only handles the
!CONFIG_PSI case for cgroup_psi() itself, but doesn't solve the unresolved
PSI types and functions. Should these kfuncs be guarded with CONFIG_PSI,
similar to how kernel/cgroup/cgroup.c handles cgroup_pressure_show()?

>
>  /**
> @@ -4734,6 +4766,8 @@ BTF_ID_FLAGS(func, bpf_cgroup_ancestor, KF_ACQUIRE | KF_RCU | KF_RET_NULL)
>  BTF_ID_FLAGS(func, bpf_cgroup_from_id, KF_ACQUIRE | KF_RET_NULL)
>  BTF_ID_FLAGS(func, bpf_task_under_cgroup, KF_RCU)
>  BTF_ID_FLAGS(func, bpf_task_get_cgroup1, KF_ACQUIRE | KF_RCU | KF_RET_NULL)
> +BTF_ID_FLAGS(func, bpf_cgroup_stall)
> +BTF_ID_FLAGS(func, bpf_cgroup_flush_stats, KF_SLEEPABLE)
>  #endif

These BTF_ID_FLAGS entries would also need CONFIG_PSI guards if the function
definitions are moved inside CONFIG_PSI.


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/25563319416

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 2/4] bpf: add bpf_cgroup_{flush_stats,stall} function
  2026-05-08 15:00 ` [PATCH v2 2/4] bpf: add bpf_cgroup_{flush_stats,stall} function Vernon Yang
  2026-05-08 15:40   ` bot+bpf-ci
@ 2026-05-08 22:01   ` sashiko-bot
  1 sibling, 0 replies; 22+ messages in thread
From: sashiko-bot @ 2026-05-08 22:01 UTC (permalink / raw)
  To: Vernon Yang; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [Medium] The `bpf_cgroup_stall` kfunc reads the 64-bit statistic `group->total` locklessly, exposing a torn read vulnerability on 32-bit architectures.
- [Low] The patch introduces a build failure when `CONFIG_CGROUPS` is enabled but `CONFIG_PSI` is disabled.
--

commit 6a28dc9991d77ff5f1bc78ed809531c8806dd33a
Author: Vernon Yang <yanglincheng@kylinos.cn>

bpf: add bpf_cgroup_{flush_stats,stall} function

Add bpf_cgroup_{flush_stats,stall} function to prepare for the
subsequent mthp_ext ebpf program.

> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index 2bb60200c2665..1c353e0ff14f4 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c

[ ... ]

> @@ -2881,6 +2882,37 @@ bpf_task_get_cgroup1(struct task_struct *task, int hierarchy_id)
>  		return NULL;
>  	return cgrp;
>  }
> +
> +/**
> + * bpf_cgroup_stall - acquire the total stall time of cgroup
> + * @cgrp: cgroup struct
> + * @states: psi states
> + *
> + * Return the total stall time.
> + */
> +__bpf_kfunc u64 bpf_cgroup_stall(struct cgroup *cgrp, enum psi_states states)
> +{
> +	struct psi_group *group = cgroup_psi(cgrp);
> +
> +	if (unlikely(!group || (u32)states >= NR_PSI_STATES - 1))
> +		return (u64)-1;
> +
> +	return div_u64(group->total[PSI_AVGS][states], NSEC_PER_MSEC);

Will this cause a torn read on 32-bit architectures?

The 64-bit cumulative counter group->total[PSI_AVGS][states] is updated using
standard non-atomic addition in collect_percpu_times(), which runs in process
context under group->avgs_lock.

Since bpf_cgroup_stall() reads this value directly without holding the lock,
and non-sleepable BPF programs can execute in NMI or hardirq contexts, they
can preempt the update process.

On 32-bit architectures, a 64-bit update and read are split into two 32-bit
operations. If the BPF program preempts the CPU between the lower and upper
half updates, it could read a torn 64-bit value, leading to an incorrect,
artificial spike or drop in the reported stall time.

> +}
> +
> +/**
> + * bpf_cgroup_flush_stats - Flush cgroup's statistics
> + * @cgrp: cgroup struct
> + */
> +__bpf_kfunc void bpf_cgroup_flush_stats(struct cgroup *cgrp)
> +{
> +	struct psi_group *group = cgroup_psi(cgrp);
> +
> +	if (unlikely(!group))
> +		return;
> +
> +	psi_group_flush_stats(group);

Can this lead to a build failure when CONFIG_CGROUPS is enabled but CONFIG_PSI
is disabled?

Both bpf_cgroup_stall() and bpf_cgroup_flush_stats() are guarded by
#ifdef CONFIG_CGROUPS, but heavily depend on PSI-specific constructs like
enum psi_states, NR_PSI_STATES, PSI_AVGS, group->total, and
psi_group_flush_stats().

When CONFIG_PSI is disabled, include/linux/psi_types.h defines
struct psi_group as an empty struct and omits definitions for the enums and
macros, and psi_group_flush_stats() is undeclared.

> +}
>  #endif /* CONFIG_CGROUPS */

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260508150055.680136-1-vernon2gm@gmail.com?part=2

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 3/4] mm: introduce bpf_mthp_ops struct ops
  2026-05-08 15:00 [PATCH v2 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent Vernon Yang
  2026-05-08 15:00 ` [PATCH v2 1/4] psi: add psi_group_flush_stats() function Vernon Yang
  2026-05-08 15:00 ` [PATCH v2 2/4] bpf: add bpf_cgroup_{flush_stats,stall} function Vernon Yang
@ 2026-05-08 15:00 ` Vernon Yang
  2026-05-08 15:40   ` bot+bpf-ci
                     ` (3 more replies)
  2026-05-08 15:00 ` [PATCH v2 4/4] samples: bpf: add mthp_ext Vernon Yang
                   ` (2 subsequent siblings)
  5 siblings, 4 replies; 22+ messages in thread
From: Vernon Yang @ 2026-05-08 15:00 UTC (permalink / raw)
  To: akpm, david, ljs, roman.gushchin, inwardvessel, shakeel.butt, ast,
	daniel, surenb
  Cc: tz2294, baohua, lance.yang, dev.jain, laoar.shao, gutierrez.asier,
	linux-kernel, linux-mm, bpf, Vernon Yang

From: Vernon Yang <yanglincheng@kylinos.cn>

Introducing bpf_mthp_ops enables eBPF programs to register the
mthp_choose callback function via cgroup-ebpf.

Using cgroup-bpf to customize mTHP size for different scenarios，
automatically select different mTHP sizes for different cgroups,
let's focus on making them truly transparent.

Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
 MAINTAINERS                     |   3 +
 include/linux/bpf_huge_memory.h |  52 ++++++++++
 include/linux/cgroup-defs.h     |   1 +
 include/linux/huge_mm.h         |   6 ++
 kernel/cgroup/cgroup.c          |   2 +
 mm/Kconfig                      |  14 +++
 mm/Makefile                     |   1 +
 mm/bpf_huge_memory.c            | 168 ++++++++++++++++++++++++++++++++
 8 files changed, 247 insertions(+)
 create mode 100644 include/linux/bpf_huge_memory.h
 create mode 100644 mm/bpf_huge_memory.c

diff --git a/MAINTAINERS b/MAINTAINERS
index caaa0d6e6056..f1113eaa1193 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4887,7 +4887,10 @@ M:	Shakeel Butt <shakeel.butt@linux.dev>
 L:	bpf@vger.kernel.org
 L:	linux-mm@kvack.org
 S:	Maintained
+F:	include/linux/bpf_huge_memory.h
+F:	mm/bpf_huge_memory.c
 F:	mm/bpf_memcontrol.c
+F:	samples/bpf/mthp_ext.*
 
 BPF [MISC]
 L:	bpf@vger.kernel.org
diff --git a/include/linux/bpf_huge_memory.h b/include/linux/bpf_huge_memory.h
new file mode 100644
index 000000000000..ffda445c9572
--- /dev/null
+++ b/include/linux/bpf_huge_memory.h
@@ -0,0 +1,52 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+
+#ifndef __BPF_HUGE_MEMORY_H
+#define __BPF_HUGE_MEMORY_H
+
+#include <linux/cgroup-defs.h>
+
+/**
+ * struct bpf_mthp_ops - BPF callbacks for mTHP operations
+ * @mthp_choose: Choose the custom mTHP orders
+ *
+ * This structure defines the interface for BPF programs to customize
+ * mTHP behavior through struct_ops programs.
+ */
+struct bpf_mthp_ops {
+	unsigned long (*mthp_choose)(struct cgroup *cgrp, unsigned long orders);
+};
+
+#ifdef CONFIG_BPF_TRANSPARENT_HUGEPAGE
+/**
+ * bpf_mthp_choose - Choose the custom mTHP orders using bpf
+ * @mm: task mm_struct
+ * @orders: original orders
+ *
+ * Return suited mTHP orders.
+ */
+unsigned long bpf_mthp_choose(struct mm_struct *mm, unsigned long orders);
+
+/**
+ * cgroup_bpf_set_mthp_ops - Set sub-cgroup mthp_ops to parent cgroup
+ * @cgrp: want to set mthp_ops of sub-cgroup
+ * @parent: parent cgroup
+ */
+static inline void cgroup_bpf_set_mthp_ops(struct cgroup *cgrp,
+					   struct cgroup *parent)
+{
+	WRITE_ONCE(cgrp->mthp_ops, parent->mthp_ops);
+}
+#else
+static inline unsigned long bpf_mthp_choose(struct mm_struct *mm,
+					    unsigned long orders)
+{
+	return orders;
+}
+static inline void cgroup_bpf_set_mthp_ops(struct cgroup *cgrp,
+					   struct cgroup *parent)
+{
+}
+#endif /* CONFIG_BPF_TRANSPARENT_HUGEPAGE */
+
+#endif /* __BPF_HUGE_MEMORY_H */
+
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index f42563739d2e..78854d0e06ab 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -628,6 +628,7 @@ struct cgroup {
 
 #ifdef CONFIG_BPF_SYSCALL
 	struct bpf_local_storage __rcu  *bpf_cgrp_storage;
+	struct bpf_mthp_ops *mthp_ops;
 #endif
 #ifdef CONFIG_EXT_SUB_SCHED
 	struct scx_sched __rcu *scx_sched;
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 127f9e1e7604..65da35fb0980 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -3,6 +3,7 @@
 #define _LINUX_HUGE_MM_H
 
 #include <linux/mm_types.h>
+#include <linux/bpf_huge_memory.h>
 
 #include <linux/fs.h> /* only for vma_is_dax() */
 #include <linux/kobject.h>
@@ -296,6 +297,11 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
 				       enum tva_type type,
 				       unsigned long orders)
 {
+	/* The eBPF-specified orders overrides which order is selected. */
+	orders &= bpf_mthp_choose(vma->vm_mm, orders);
+	if (!orders)
+		return 0;
+
 	/*
 	 * Optimization to check if required orders are enabled early. Only
 	 * forced collapse ignores sysfs configs.
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 43adc96c7f1a..1dbef3e8b179 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5836,6 +5836,8 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
 	if (ret)
 		goto out_stat_exit;
 
+	cgroup_bpf_set_mthp_ops(cgrp, parent);
+
 	for (tcgrp = cgrp; tcgrp; tcgrp = cgroup_parent(tcgrp))
 		cgrp->ancestors[tcgrp->level] = tcgrp;
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 27dc5b0139ba..be49bde783a7 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -949,6 +949,20 @@ config NO_PAGE_MAPCOUNT
 
 	  EXPERIMENTAL because the impact of some changes is still unclear.
 
+config BPF_TRANSPARENT_HUGEPAGE
+	bool "BPF-based transparent hugepage (EXPERIMENTAL)"
+	depends on TRANSPARENT_HUGEPAGE && CGROUP_BPF
+	help
+	  Using cgroup-bpf to customize mTHP size for different scenarios,
+	  automatically select different mTHP sizes for different cgroups,
+	  let's focus on making them truly transparent.
+
+	  This is an experimental feature, that might go away at any time,
+	  Please do not rely any production environment.
+
+	  EXPERIMENTAL because the BPF interface is unstable and may be removed
+	  at any time.
+
 endif # TRANSPARENT_HUGEPAGE
 
 # simple helper to make the code a bit easier to read
diff --git a/mm/Makefile b/mm/Makefile
index 8ad2ab08244e..b474c21c3253 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -108,6 +108,7 @@ obj-$(CONFIG_MEMCG) += swap_cgroup.o
 endif
 ifdef CONFIG_BPF_SYSCALL
 obj-$(CONFIG_MEMCG) += bpf_memcontrol.o
+obj-$(CONFIG_BPF_TRANSPARENT_HUGEPAGE) += bpf_huge_memory.o
 endif
 obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
 obj-$(CONFIG_GUP_TEST) += gup_test.o
diff --git a/mm/bpf_huge_memory.c b/mm/bpf_huge_memory.c
new file mode 100644
index 000000000000..851c6ebe2933
--- /dev/null
+++ b/mm/bpf_huge_memory.c
@@ -0,0 +1,168 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Huge memory related BPF code
+ *
+ * Author: Vernon Yang <yanglincheng@kylinos.cn>
+ */
+
+#include <linux/bpf.h>
+#include <linux/srcu.h>
+
+/* Protects cgrp->mthp_ops pointer for read and write. */
+DEFINE_SRCU(mthp_bpf_srcu);
+
+unsigned long bpf_mthp_choose(struct mm_struct *mm, unsigned long orders)
+{
+	struct cgroup *cgrp;
+	struct mem_cgroup *memcg;
+	struct bpf_mthp_ops *ops;
+	int idx;
+
+	memcg = get_mem_cgroup_from_mm(mm);
+	if (!memcg)
+		return orders;
+
+	cgrp = memcg->css.cgroup;
+
+	idx = srcu_read_lock(&mthp_bpf_srcu);
+	ops = READ_ONCE(cgrp->mthp_ops);
+	if (unlikely(ops && ops->mthp_choose))
+		orders = ops->mthp_choose(cgrp, orders);
+	srcu_read_unlock(&mthp_bpf_srcu, idx);
+
+	mem_cgroup_put(memcg);
+
+	return orders;
+}
+
+static int bpf_mthp_ops_btf_struct_access(struct bpf_verifier_log *log,
+		const struct bpf_reg_state *reg, int off, int size)
+{
+	return -EACCES;
+}
+
+static bool bpf_mthp_ops_is_valid_access(int off, int size, enum bpf_access_type type,
+		const struct bpf_prog *prog, struct bpf_insn_access_aux *info)
+{
+	return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
+}
+
+const struct bpf_verifier_ops bpf_mthp_verifier_ops = {
+	.get_func_proto = bpf_base_func_proto,
+	.btf_struct_access = bpf_mthp_ops_btf_struct_access,
+	.is_valid_access = bpf_mthp_ops_is_valid_access,
+};
+
+static int bpf_mthp_ops_reg(void *kdata, struct bpf_link *link)
+{
+	struct bpf_struct_ops_link *st_link = (struct bpf_struct_ops_link *)link;
+	struct bpf_mthp_ops *ops = kdata;
+	struct cgroup_subsys_state *child;
+	struct cgroup *cgrp;
+
+	if (!link)
+		return -EOPNOTSUPP;
+
+	cgrp = st_link->cgroup;
+	if (!cgrp)
+		return -EINVAL;
+
+	cgroup_lock();
+	css_for_each_descendant_pre(child, &cgrp->self) {
+		if (READ_ONCE(child->cgroup->mthp_ops)) {
+			pr_warn("sub-cgroup has already registered.\n");
+			cgroup_unlock();
+			return -EBUSY;
+		}
+	}
+	css_for_each_descendant_pre(child, &cgrp->self)
+		WRITE_ONCE(child->cgroup->mthp_ops, ops);
+	cgroup_unlock();
+
+	return 0;
+}
+
+static void bpf_mthp_ops_unreg(void *kdata, struct bpf_link *link)
+{
+	struct bpf_struct_ops_link *st_link = (struct bpf_struct_ops_link *)link;
+	struct cgroup_subsys_state *child;
+	struct cgroup *cgrp;
+
+	if (!link)
+		return;
+
+	cgrp = st_link->cgroup;
+	if (!cgrp)
+		return;
+
+	cgroup_lock();
+	css_for_each_descendant_pre(child, &cgrp->self)
+		WRITE_ONCE(child->cgroup->mthp_ops, NULL);
+	cgroup_unlock();
+
+	synchronize_srcu(&mthp_bpf_srcu);
+}
+
+static int bpf_mthp_ops_check_member(const struct btf_type *t,
+				     const struct btf_member *member,
+				     const struct bpf_prog *prog)
+{
+	u32 moff = __btf_member_bit_offset(t, member) / 8;
+
+	switch (moff) {
+	case offsetof(struct bpf_mthp_ops, mthp_choose):
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	if (prog->sleepable)
+		return -EINVAL;
+
+	return 0;
+}
+
+static int bpf_mthp_ops_init_member(const struct btf_type *t,
+				    const struct btf_member *member,
+				    void *kdata, const void *udata)
+{
+	return 0;
+}
+
+static int bpf_mthp_ops_init(struct btf *btf)
+{
+	return 0;
+}
+
+static unsigned long cfi_mthp_choose(struct cgroup *cgrp, unsigned long orders)
+{
+	return 0;
+}
+
+static struct bpf_mthp_ops cfi_bpf_mthp_ops = {
+	.mthp_choose = cfi_mthp_choose,
+};
+
+static struct bpf_struct_ops bso_bpf_mthp_ops = {
+	.verifier_ops = &bpf_mthp_verifier_ops,
+	.reg = bpf_mthp_ops_reg,
+	.unreg = bpf_mthp_ops_unreg,
+	.check_member = bpf_mthp_ops_check_member,
+	.init_member = bpf_mthp_ops_init_member,
+	.init = bpf_mthp_ops_init,
+	.name = "bpf_mthp_ops",
+	.owner = THIS_MODULE,
+	.cfi_stubs = &cfi_bpf_mthp_ops,
+};
+
+static int __init bpf_huge_memory_init(void)
+{
+	int err;
+
+	err = register_bpf_struct_ops(&bso_bpf_mthp_ops, bpf_mthp_ops);
+	if (err)
+		pr_warn("Registration of bpf_mthp_ops failed, err %d\n", err);
+
+	return err;
+}
+late_initcall(bpf_huge_memory_init);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 3/4] mm: introduce bpf_mthp_ops struct ops
  2026-05-08 15:00 ` [PATCH v2 3/4] mm: introduce bpf_mthp_ops struct ops Vernon Yang
@ 2026-05-08 15:40   ` bot+bpf-ci
  2026-05-08 15:57   ` Lorenzo Stoakes
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 22+ messages in thread
From: bot+bpf-ci @ 2026-05-08 15:40 UTC (permalink / raw)
  To: vernon2gm, akpm, david, ljs, roman.gushchin, inwardvessel,
	shakeel.butt, ast, daniel, surenb
  Cc: tz2294, baohua, lance.yang, dev.jain, laoar.shao, gutierrez.asier,
	linux-kernel, linux-mm, bpf, yanglincheng, ast, andrii, daniel,
	martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 4644 bytes --]

> diff --git a/MAINTAINERS b/MAINTAINERS
> index 13f49378b157..cad3a87d9873 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS

[ ... ]

> diff --git a/include/linux/bpf_huge_memory.h b/include/linux/bpf_huge_memory.h
> new file mode 100644
> index 000000000000..ffda445c9572
> --- /dev/null
> +++ b/include/linux/bpf_huge_memory.h

[ ... ]

> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index f42563739d2e..78854d0e06ab 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h

[ ... ]

> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 2949e5acff35..80ec622213df 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h

[ ... ]

> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 1f084ee71443..7db83ca81e6e 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c

[ ... ]

> diff --git a/mm/Kconfig b/mm/Kconfig
> index 0a43bb80df4f..f3309376bcc5 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -962,6 +962,20 @@ config NO_PAGE_MAPCOUNT
>
>  	  EXPERIMENTAL because the impact of some changes is still unclear.
>
> +config BPF_TRANSPARENT_HUGEPAGE
> +	bool "BPF-based transparent hugepage (EXPERIMENTAL)"
> +	depends on TRANSPARENT_HUGEPAGE && CGROUP_BPF
> +	help
> +	  Using cgroup-bpf to customize mTHP size for different scenarios,
> +	  automatically select different mTHP sizes for different cgroups,
> +	  let's focus on making them truly transparent.
> +
> +	  This is an experimental feature, that might go away at any time,
> +	  Please do not rely any production environment.
                                  ^^^

This isn't a bug, but there's a grammar issue in the Kconfig help text.
The sentence is missing a preposition and pronoun. It should probably be
"Please do not rely on this in any production environment."

> +
> +	  EXPERIMENTAL because the BPF interface is unstable and may be removed
> +	  at any time.
> +

[ ... ]

> diff --git a/mm/bpf_huge_memory.c b/mm/bpf_huge_memory.c
> new file mode 100644
> index 000000000000..851c6ebe2933
> --- /dev/null
> +++ b/mm/bpf_huge_memory.c
> @@ -0,0 +1,168 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Huge memory related BPF code
> + *
> + * Author: Vernon Yang <yanglincheng@kylinos.cn>
> + */
> +
> +#include <linux/bpf.h>
> +#include <linux/srcu.h>

[ ... ]

> +static int bpf_mthp_ops_reg(void *kdata, struct bpf_link *link)
> +{
> +	struct bpf_struct_ops_link *st_link = (struct bpf_struct_ops_link *)link;
> +	struct bpf_mthp_ops *ops = kdata;
> +	struct cgroup_subsys_state *child;
> +	struct cgroup *cgrp;
> +
> +	if (!link)
> +		return -EOPNOTSUPP;
> +
> +	cgrp = st_link->cgroup;
            ^^^^^^^^^^^^^^^^^^^^

Does struct bpf_struct_ops_link have a cgroup field? Looking at
kernel/bpf/bpf_struct_ops.c, the structure is defined as:

    struct bpf_struct_ops_link {
        struct bpf_link link;
        struct bpf_map __rcu *map;
        wait_queue_head_t wait_hup;
    };

There is no cgroup member. The same issue occurs in bpf_mthp_ops_unreg()
where st_link->cgroup is accessed. This would cause a compilation error
when CONFIG_BPF_TRANSPARENT_HUGEPAGE is enabled.

Were you intending to extend struct bpf_struct_ops_link to support
cgroup-scoped struct_ops, or is there a different mechanism planned for
associating the link with a specific cgroup?

> +	if (!cgrp)
> +		return -EINVAL;
> +
> +	cgroup_lock();
> +	css_for_each_descendant_pre(child, &cgrp->self) {
> +		if (READ_ONCE(child->cgroup->mthp_ops)) {
> +			pr_warn("sub-cgroup has already registered.\n");
> +			cgroup_unlock();
> +			return -EBUSY;
> +		}
> +	}
> +	css_for_each_descendant_pre(child, &cgrp->self)
> +		WRITE_ONCE(child->cgroup->mthp_ops, ops);
> +	cgroup_unlock();
> +
> +	return 0;
> +}
> +
> +static void bpf_mthp_ops_unreg(void *kdata, struct bpf_link *link)
> +{
> +	struct bpf_struct_ops_link *st_link = (struct bpf_struct_ops_link *)link;
> +	struct cgroup_subsys_state *child;
> +	struct cgroup *cgrp;
> +
> +	if (!link)
> +		return;
> +
> +	cgrp = st_link->cgroup;
            ^^^^^^^^^^^^^^^^^^^^

Same issue here - the cgroup field doesn't exist in the struct.

> +	if (!cgrp)
> +		return;
> +
> +	cgroup_lock();
> +	css_for_each_descendant_pre(child, &cgrp->self)
> +		WRITE_ONCE(child->cgroup->mthp_ops, NULL);
> +	cgroup_unlock();
> +
> +	synchronize_srcu(&mthp_bpf_srcu);
> +}

[ ... ]


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/25563319416

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 3/4] mm: introduce bpf_mthp_ops struct ops
  2026-05-08 15:00 ` [PATCH v2 3/4] mm: introduce bpf_mthp_ops struct ops Vernon Yang
  2026-05-08 15:40   ` bot+bpf-ci
@ 2026-05-08 15:57   ` Lorenzo Stoakes
  2026-05-08 20:54   ` David Hildenbrand (Arm)
  2026-05-08 22:29   ` sashiko-bot
  3 siblings, 0 replies; 22+ messages in thread
From: Lorenzo Stoakes @ 2026-05-08 15:57 UTC (permalink / raw)
  To: Vernon Yang
  Cc: akpm, david, roman.gushchin, inwardvessel, shakeel.butt, ast,
	daniel, surenb, tz2294, baohua, lance.yang, dev.jain, laoar.shao,
	gutierrez.asier, linux-kernel, linux-mm, bpf, Vernon Yang

NACK

This patch not only overreaches by fundamentally impacting THP behaviour (which
has NOTHING to do with the subject line), but also, unbelievably, taking control
over this away from the THP maintainers, are you actually serious here?

On Fri, May 08, 2026 at 11:00:54PM +0800, Vernon Yang wrote:
> From: Vernon Yang <yanglincheng@kylinos.cn>
>
> Introducing bpf_mthp_ops enables eBPF programs to register the
> mthp_choose callback function via cgroup-ebpf.
>
> Using cgroup-bpf to customize mTHP size for different scenarios，
> automatically select different mTHP sizes for different cgroups,
> let's focus on making them truly transparent.

Err, wait what? You're both adding a BPF hook and then adding a default policy
change that affects all cgroups anyway?

Or are you not and this message is just wrong (I don't really see how you're
'automatically' doing anything here).

And the commit message is 'add struct opts'?

>
> Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>

Please have a bit of a think about how you're approaching this, wait until the
THP code has actually been reworked (you can contribute patches to speed that
up), before even thinking of sending something like this again, and then send it
as an RFC.

> ---
>  MAINTAINERS                     |   3 +
>  include/linux/bpf_huge_memory.h |  52 ++++++++++
>  include/linux/cgroup-defs.h     |   1 +
>  include/linux/huge_mm.h         |   6 ++
>  kernel/cgroup/cgroup.c          |   2 +
>  mm/Kconfig                      |  14 +++
>  mm/Makefile                     |   1 +
>  mm/bpf_huge_memory.c            | 168 ++++++++++++++++++++++++++++++++
>  8 files changed, 247 insertions(+)
>  create mode 100644 include/linux/bpf_huge_memory.h
>  create mode 100644 mm/bpf_huge_memory.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index caaa0d6e6056..f1113eaa1193 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -4887,7 +4887,10 @@ M:	Shakeel Butt <shakeel.butt@linux.dev>
>  L:	bpf@vger.kernel.org
>  L:	linux-mm@kvack.org
>  S:	Maintained
> +F:	include/linux/bpf_huge_memory.h
> +F:	mm/bpf_huge_memory.c

Err what??

You're adding THP-specific behaviour to 'BPF [MEMORY MANAGEMENT EXTENSIONS]'?

I'm sorry but what on earth possessed you to do that?

>  F:	mm/bpf_memcontrol.c
> +F:	samples/bpf/mthp_ext.*
>
>  BPF [MISC]
>  L:	bpf@vger.kernel.org
> diff --git a/include/linux/bpf_huge_memory.h b/include/linux/bpf_huge_memory.h
> new file mode 100644
> index 000000000000..ffda445c9572
> --- /dev/null
> +++ b/include/linux/bpf_huge_memory.h
> @@ -0,0 +1,52 @@
> +/* SPDX-License-Identifier: GPL-2.0+ */
> +
> +#ifndef __BPF_HUGE_MEMORY_H
> +#define __BPF_HUGE_MEMORY_H
> +
> +#include <linux/cgroup-defs.h>
> +
> +/**
> + * struct bpf_mthp_ops - BPF callbacks for mTHP operations
> + * @mthp_choose: Choose the custom mTHP orders
> + *
> + * This structure defines the interface for BPF programs to customize
> + * mTHP behavior through struct_ops programs.
> + */
> +struct bpf_mthp_ops {
> +	unsigned long (*mthp_choose)(struct cgroup *cgrp, unsigned long orders);
> +};
> +
> +#ifdef CONFIG_BPF_TRANSPARENT_HUGEPAGE
> +/**
> + * bpf_mthp_choose - Choose the custom mTHP orders using bpf
> + * @mm: task mm_struct
> + * @orders: original orders
> + *
> + * Return suited mTHP orders.
> + */
> +unsigned long bpf_mthp_choose(struct mm_struct *mm, unsigned long orders);
> +
> +/**
> + * cgroup_bpf_set_mthp_ops - Set sub-cgroup mthp_ops to parent cgroup
> + * @cgrp: want to set mthp_ops of sub-cgroup
> + * @parent: parent cgroup
> + */
> +static inline void cgroup_bpf_set_mthp_ops(struct cgroup *cgrp,
> +					   struct cgroup *parent)
> +{
> +	WRITE_ONCE(cgrp->mthp_ops, parent->mthp_ops);
> +}
> +#else
> +static inline unsigned long bpf_mthp_choose(struct mm_struct *mm,
> +					    unsigned long orders)
> +{
> +	return orders;
> +}
> +static inline void cgroup_bpf_set_mthp_ops(struct cgroup *cgrp,
> +					   struct cgroup *parent)
> +{
> +}
> +#endif /* CONFIG_BPF_TRANSPARENT_HUGEPAGE */

These have the same interface flaws as the original THP BPF work. We don't know
whether we want BPF interfering on this decision and it impacts future
development on this.

> +
> +#endif /* __BPF_HUGE_MEMORY_H */
> +
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index f42563739d2e..78854d0e06ab 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -628,6 +628,7 @@ struct cgroup {
>
>  #ifdef CONFIG_BPF_SYSCALL
>  	struct bpf_local_storage __rcu  *bpf_cgrp_storage;
> +	struct bpf_mthp_ops *mthp_ops;
>  #endif
>  #ifdef CONFIG_EXT_SUB_SCHED
>  	struct scx_sched __rcu *scx_sched;
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 127f9e1e7604..65da35fb0980 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -3,6 +3,7 @@
>  #define _LINUX_HUGE_MM_H
>
>  #include <linux/mm_types.h>
> +#include <linux/bpf_huge_memory.h>
>
>  #include <linux/fs.h> /* only for vma_is_dax() */
>  #include <linux/kobject.h>
> @@ -296,6 +297,11 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>  				       enum tva_type type,
>  				       unsigned long orders)
>  {
> +	/* The eBPF-specified orders overrides which order is selected. */
> +	orders &= bpf_mthp_choose(vma->vm_mm, orders);

OK so every single time we call thp_vma_allowable_orders() we take an SRCU lock,
even if there aren't any BPF hooks?...!

And guess what, nobody in THP can do a damn thing to change it since you took
control of that away from us.

No dude.

> +	if (!orders)
> +		return 0;
> +
>  	/*
>  	 * Optimization to check if required orders are enabled early. Only
>  	 * forced collapse ignores sysfs configs.
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 43adc96c7f1a..1dbef3e8b179 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -5836,6 +5836,8 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
>  	if (ret)
>  		goto out_stat_exit;
>
> +	cgroup_bpf_set_mthp_ops(cgrp, parent);
> +

I'm not loving putting this is a fundamental cgroup function like this.

>  	for (tcgrp = cgrp; tcgrp; tcgrp = cgroup_parent(tcgrp))
>  		cgrp->ancestors[tcgrp->level] = tcgrp;
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 27dc5b0139ba..be49bde783a7 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -949,6 +949,20 @@ config NO_PAGE_MAPCOUNT
>
>  	  EXPERIMENTAL because the impact of some changes is still unclear.
>
> +config BPF_TRANSPARENT_HUGEPAGE
> +	bool "BPF-based transparent hugepage (EXPERIMENTAL)"

Experimental means nothing.

> +	depends on TRANSPARENT_HUGEPAGE && CGROUP_BPF
> +	help
> +	  Using cgroup-bpf to customize mTHP size for different scenarios,
> +	  automatically select different mTHP sizes for different cgroups,
> +	  let's focus on making them truly transparent.
> +
> +	  This is an experimental feature, that might go away at any time,
> +	  Please do not rely any production environment.

That's not how BPF works.

> +
> +	  EXPERIMENTAL because the BPF interface is unstable and may be removed
> +	  at any time.

That's not how BPF works.

Did you even follow what was said on the last THP BPF series?

The interface is permanent, it doesn't matter what experimental labels you put
on it.

> +
>  endif # TRANSPARENT_HUGEPAGE
>
>  # simple helper to make the code a bit easier to read
> diff --git a/mm/Makefile b/mm/Makefile
> index 8ad2ab08244e..b474c21c3253 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -108,6 +108,7 @@ obj-$(CONFIG_MEMCG) += swap_cgroup.o
>  endif
>  ifdef CONFIG_BPF_SYSCALL
>  obj-$(CONFIG_MEMCG) += bpf_memcontrol.o
> +obj-$(CONFIG_BPF_TRANSPARENT_HUGEPAGE) += bpf_huge_memory.o
>  endif
>  obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
>  obj-$(CONFIG_GUP_TEST) += gup_test.o
> diff --git a/mm/bpf_huge_memory.c b/mm/bpf_huge_memory.c
> new file mode 100644
> index 000000000000..851c6ebe2933
> --- /dev/null
> +++ b/mm/bpf_huge_memory.c
> @@ -0,0 +1,168 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Huge memory related BPF code

Honestly reading this is making me a bit... annoyed :)

You seem to be trying to take control of THP away from the THP maintainers and
reviewers who work bloody hard for the community.

I'm sure you don't mean to, but it's not at all welcome!

I'll stop here, this series is a no.

> + *
> + * Author: Vernon Yang <yanglincheng@kylinos.cn>
> + */
> +
> +#include <linux/bpf.h>
> +#include <linux/srcu.h>
> +
> +/* Protects cgrp->mthp_ops pointer for read and write. */
> +DEFINE_SRCU(mthp_bpf_srcu);
> +
> +unsigned long bpf_mthp_choose(struct mm_struct *mm, unsigned long orders)
> +{
> +	struct cgroup *cgrp;
> +	struct mem_cgroup *memcg;
> +	struct bpf_mthp_ops *ops;
> +	int idx;
> +
> +	memcg = get_mem_cgroup_from_mm(mm);
> +	if (!memcg)
> +		return orders;
> +
> +	cgrp = memcg->css.cgroup;
> +
> +	idx = srcu_read_lock(&mthp_bpf_srcu);
> +	ops = READ_ONCE(cgrp->mthp_ops);
> +	if (unlikely(ops && ops->mthp_choose))
> +		orders = ops->mthp_choose(cgrp, orders);
> +	srcu_read_unlock(&mthp_bpf_srcu, idx);
> +
> +	mem_cgroup_put(memcg);
> +
> +	return orders;
> +}
> +
> +static int bpf_mthp_ops_btf_struct_access(struct bpf_verifier_log *log,
> +		const struct bpf_reg_state *reg, int off, int size)
> +{
> +	return -EACCES;
> +}
> +
> +static bool bpf_mthp_ops_is_valid_access(int off, int size, enum bpf_access_type type,
> +		const struct bpf_prog *prog, struct bpf_insn_access_aux *info)
> +{
> +	return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
> +}
> +
> +const struct bpf_verifier_ops bpf_mthp_verifier_ops = {
> +	.get_func_proto = bpf_base_func_proto,
> +	.btf_struct_access = bpf_mthp_ops_btf_struct_access,
> +	.is_valid_access = bpf_mthp_ops_is_valid_access,
> +};
> +
> +static int bpf_mthp_ops_reg(void *kdata, struct bpf_link *link)
> +{
> +	struct bpf_struct_ops_link *st_link = (struct bpf_struct_ops_link *)link;
> +	struct bpf_mthp_ops *ops = kdata;
> +	struct cgroup_subsys_state *child;
> +	struct cgroup *cgrp;
> +
> +	if (!link)
> +		return -EOPNOTSUPP;
> +
> +	cgrp = st_link->cgroup;
> +	if (!cgrp)
> +		return -EINVAL;
> +
> +	cgroup_lock();
> +	css_for_each_descendant_pre(child, &cgrp->self) {
> +		if (READ_ONCE(child->cgroup->mthp_ops)) {
> +			pr_warn("sub-cgroup has already registered.\n");
> +			cgroup_unlock();
> +			return -EBUSY;
> +		}
> +	}
> +	css_for_each_descendant_pre(child, &cgrp->self)
> +		WRITE_ONCE(child->cgroup->mthp_ops, ops);
> +	cgroup_unlock();
> +
> +	return 0;
> +}
> +
> +static void bpf_mthp_ops_unreg(void *kdata, struct bpf_link *link)
> +{
> +	struct bpf_struct_ops_link *st_link = (struct bpf_struct_ops_link *)link;
> +	struct cgroup_subsys_state *child;
> +	struct cgroup *cgrp;
> +
> +	if (!link)
> +		return;
> +
> +	cgrp = st_link->cgroup;
> +	if (!cgrp)
> +		return;
> +
> +	cgroup_lock();
> +	css_for_each_descendant_pre(child, &cgrp->self)
> +		WRITE_ONCE(child->cgroup->mthp_ops, NULL);
> +	cgroup_unlock();
> +
> +	synchronize_srcu(&mthp_bpf_srcu);
> +}
> +
> +static int bpf_mthp_ops_check_member(const struct btf_type *t,
> +				     const struct btf_member *member,
> +				     const struct bpf_prog *prog)
> +{
> +	u32 moff = __btf_member_bit_offset(t, member) / 8;
> +
> +	switch (moff) {
> +	case offsetof(struct bpf_mthp_ops, mthp_choose):
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	if (prog->sleepable)
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
> +static int bpf_mthp_ops_init_member(const struct btf_type *t,
> +				    const struct btf_member *member,
> +				    void *kdata, const void *udata)
> +{
> +	return 0;
> +}
> +
> +static int bpf_mthp_ops_init(struct btf *btf)
> +{
> +	return 0;
> +}
> +
> +static unsigned long cfi_mthp_choose(struct cgroup *cgrp, unsigned long orders)
> +{
> +	return 0;
> +}
> +
> +static struct bpf_mthp_ops cfi_bpf_mthp_ops = {
> +	.mthp_choose = cfi_mthp_choose,
> +};
> +
> +static struct bpf_struct_ops bso_bpf_mthp_ops = {
> +	.verifier_ops = &bpf_mthp_verifier_ops,
> +	.reg = bpf_mthp_ops_reg,
> +	.unreg = bpf_mthp_ops_unreg,
> +	.check_member = bpf_mthp_ops_check_member,
> +	.init_member = bpf_mthp_ops_init_member,
> +	.init = bpf_mthp_ops_init,
> +	.name = "bpf_mthp_ops",
> +	.owner = THIS_MODULE,
> +	.cfi_stubs = &cfi_bpf_mthp_ops,
> +};
> +
> +static int __init bpf_huge_memory_init(void)
> +{
> +	int err;
> +
> +	err = register_bpf_struct_ops(&bso_bpf_mthp_ops, bpf_mthp_ops);
> +	if (err)
> +		pr_warn("Registration of bpf_mthp_ops failed, err %d\n", err);
> +
> +	return err;
> +}
> +late_initcall(bpf_huge_memory_init);
> --
> 2.53.0
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 3/4] mm: introduce bpf_mthp_ops struct ops
  2026-05-08 15:00 ` [PATCH v2 3/4] mm: introduce bpf_mthp_ops struct ops Vernon Yang
  2026-05-08 15:40   ` bot+bpf-ci
  2026-05-08 15:57   ` Lorenzo Stoakes
@ 2026-05-08 20:54   ` David Hildenbrand (Arm)
  2026-05-11 11:25     ` Lorenzo Stoakes
  2026-05-08 22:29   ` sashiko-bot
  3 siblings, 1 reply; 22+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-08 20:54 UTC (permalink / raw)
  To: Vernon Yang, akpm, ljs, roman.gushchin, inwardvessel,
	shakeel.butt, ast, daniel, surenb
  Cc: tz2294, baohua, lance.yang, dev.jain, laoar.shao, gutierrez.asier,
	linux-kernel, linux-mm, bpf, Vernon Yang

>  
>  #include <linux/fs.h> /* only for vma_is_dax() */
>  #include <linux/kobject.h>
> @@ -296,6 +297,11 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>  				       enum tva_type type,
>  				       unsigned long orders)
>  {
> +	/* The eBPF-specified orders overrides which order is selected. */
> +	orders &= bpf_mthp_choose(vma->vm_mm, orders);
> +	if (!orders)
> +		return 0;
> +

There was some discussion around this in the past: where should we hook into
(e.g., deferred shrinker?), which information should we provide to the hook
(e.g., vma properties?).

We concluded mostly to "we don't know". I know that Rik van Riel wanted to look
into doing this properly, but seems like he got distracted :)

I assume there will be a lwn.net article covering the "BPF in MM" session we had
at LSF/MM just this week.

Conclusion: ABI stability is a headake.

The simplistic approach of deciding an order for the whole MM is very likely not
what we want.

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 3/4] mm: introduce bpf_mthp_ops struct ops
  2026-05-08 20:54   ` David Hildenbrand (Arm)
@ 2026-05-11 11:25     ` Lorenzo Stoakes
  0 siblings, 0 replies; 22+ messages in thread
From: Lorenzo Stoakes @ 2026-05-11 11:25 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Vernon Yang, akpm, roman.gushchin, inwardvessel, shakeel.butt,
	ast, daniel, surenb, tz2294, baohua, lance.yang, dev.jain,
	laoar.shao, gutierrez.asier, linux-kernel, linux-mm, bpf,
	Vernon Yang

On Fri, May 08, 2026 at 10:54:58PM +0200, David Hildenbrand (Arm) wrote:
> >
> >  #include <linux/fs.h> /* only for vma_is_dax() */
> >  #include <linux/kobject.h>
> > @@ -296,6 +297,11 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
> >  				       enum tva_type type,
> >  				       unsigned long orders)
> >  {
> > +	/* The eBPF-specified orders overrides which order is selected. */
> > +	orders &= bpf_mthp_choose(vma->vm_mm, orders);
> > +	if (!orders)
> > +		return 0;
> > +
>
> There was some discussion around this in the past: where should we hook into
> (e.g., deferred shrinker?), which information should we provide to the hook
> (e.g., vma properties?).
>
> We concluded mostly to "we don't know". I know that Rik van Riel wanted to look
> into doing this properly, but seems like he got distracted :)
>
> I assume there will be a lwn.net article covering the "BPF in MM" session we had
> at LSF/MM just this week.
>
> Conclusion: ABI stability is a headake.
>
> The simplistic approach of deciding an order for the whole MM is very likely not
> what we want.

Agreed with all points!

>
> --
> Cheers,
>
> David

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 3/4] mm: introduce bpf_mthp_ops struct ops
  2026-05-08 15:00 ` [PATCH v2 3/4] mm: introduce bpf_mthp_ops struct ops Vernon Yang
                     ` (2 preceding siblings ...)
  2026-05-08 20:54   ` David Hildenbrand (Arm)
@ 2026-05-08 22:29   ` sashiko-bot
  3 siblings, 0 replies; 22+ messages in thread
From: sashiko-bot @ 2026-05-08 22:29 UTC (permalink / raw)
  To: Vernon Yang; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:
- [Critical] BPF architectural layering violation resulting in a Use-After-Free risk.
- [High] Broken cgroup inheritance prevents overriding parent policies.
- [High] Unconditional atomic operations and SRCU locking on the page fault hot path.
- [Low] Build failures due to incomplete types and missing header includes.
--

commit 5151cfb43b99be8eef5398ab31786c76a122dc79
Author: Vernon Yang <yanglincheng@kylinos.cn>

mm: introduce bpf_mthp_ops struct ops

This commit introduces a new BPF struct_ops called bpf_mthp_ops to allow eBPF
programs to customize the sizes of multi-size THP (mTHP) for different
cgroups. It adds bpf_mthp_choose() to the page fault path to retrieve the
BPF-specified orders.

> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 1f084ee71443e..7db83ca81e6e7 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -5836,6 +5836,8 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
>  	if (ret)
>  		goto out_stat_exit;
>  
> +	cgroup_bpf_set_mthp_ops(cgrp, parent);

Does this file need to include <linux/bpf_huge_memory.h> for this function
call? This might cause a build failure due to a missing declaration.

> +
>  	for (tcgrp = cgrp; tcgrp; tcgrp = cgroup_parent(tcgrp))
>  		cgrp->ancestors[tcgrp->level] = tcgrp;

[ ... ]

> diff --git a/mm/bpf_huge_memory.c b/mm/bpf_huge_memory.c
> new file mode 100644
> index 0000000000000..851c6ebe29334
> --- /dev/null
> +++ b/mm/bpf_huge_memory.c
> @@ -0,0 +1,168 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
[ ... ]
> +#include <linux/bpf.h>
> +#include <linux/srcu.h>

Does this file also need to include <linux/memcontrol.h> and <linux/cgroup.h>?
Functions like get_mem_cgroup_from_mm() and css_for_each_descendant_pre() are
used below, which might lead to build failures.

> +
> +/* Protects cgrp->mthp_ops pointer for read and write. */
> +DEFINE_SRCU(mthp_bpf_srcu);
> +
> +unsigned long bpf_mthp_choose(struct mm_struct *mm, unsigned long orders)
> +{
> +	struct cgroup *cgrp;
> +	struct mem_cgroup *memcg;
> +	struct bpf_mthp_ops *ops;
> +	int idx;
> +
> +	memcg = get_mem_cgroup_from_mm(mm);

Since thp_vma_allowable_orders() calls this function unconditionally on every
THP allocation, does executing get_mem_cgroup_from_mm() and modifying atomic
reference counts on the page fault hot path introduce significant overhead?

Could a static key be used to bypass these atomic operations and the SRCU
locking below when the BPF feature is inactive?

> +	if (!memcg)
> +		return orders;
> +
> +	cgrp = memcg->css.cgroup;
> +
> +	idx = srcu_read_lock(&mthp_bpf_srcu);
> +	ops = READ_ONCE(cgrp->mthp_ops);
> +	if (unlikely(ops && ops->mthp_choose))
> +		orders = ops->mthp_choose(cgrp, orders);
> +	srcu_read_unlock(&mthp_bpf_srcu, idx);
> +
> +	mem_cgroup_put(memcg);
> +
> +	return orders;
> +}

[ ... ]

> +static int bpf_mthp_ops_reg(void *kdata, struct bpf_link *link)
> +{
> +	struct bpf_struct_ops_link *st_link = (struct bpf_struct_ops_link *)link;
> +	struct bpf_mthp_ops *ops = kdata;
> +	struct cgroup_subsys_state *child;
> +	struct cgroup *cgrp;
> +
> +	if (!link)
> +		return -EOPNOTSUPP;
> +
> +	cgrp = st_link->cgroup;

Can this code successfully compile? struct bpf_struct_ops_link appears to be an
incomplete type defined privately in kernel/bpf/bpf_struct_ops.c, and it lacks
a cgroup field.

Additionally, does bpf_mthp_ops_reg() need to acquire a reference count on the
target cgroup using cgroup_get()? If the cgroup is deleted while the link
remains active, could st_link->cgroup become a dangling pointer and lead to a
use-after-free when bpf_mthp_ops_unreg() is later called?

> +	if (!cgrp)
> +		return -EINVAL;
> +
> +	cgroup_lock();
> +	css_for_each_descendant_pre(child, &cgrp->self) {
> +		if (READ_ONCE(child->cgroup->mthp_ops)) {
> +			pr_warn("sub-cgroup has already registered.\n");
> +			cgroup_unlock();
> +			return -EBUSY;
> +		}
> +	}

Since cgroup_create() copies parent->mthp_ops directly to the child cgroup,
wouldn't all descendants inherently have mthp_ops set if their parent does?
Does this block users from registering an overriding BPF program on the child
cgroup, or block the parent if a child registers first?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260508150055.680136-1-vernon2gm@gmail.com?part=3

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 4/4] samples: bpf: add mthp_ext
  2026-05-08 15:00 [PATCH v2 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent Vernon Yang
                   ` (2 preceding siblings ...)
  2026-05-08 15:00 ` [PATCH v2 3/4] mm: introduce bpf_mthp_ops struct ops Vernon Yang
@ 2026-05-08 15:00 ` Vernon Yang
  2026-05-08 15:40   ` bot+bpf-ci
  2026-05-08 22:52   ` sashiko-bot
  2026-05-08 15:14 ` [PATCH v2 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent Lorenzo Stoakes
  2026-05-08 16:00 ` Pedro Falcato
  5 siblings, 2 replies; 22+ messages in thread
From: Vernon Yang @ 2026-05-08 15:00 UTC (permalink / raw)
  To: akpm, david, ljs, roman.gushchin, inwardvessel, shakeel.butt, ast,
	daniel, surenb
  Cc: tz2294, baohua, lance.yang, dev.jain, laoar.shao, gutierrez.asier,
	linux-kernel, linux-mm, bpf, Vernon Yang

From: Vernon Yang <yanglincheng@kylinos.cn>

Design mthp_ext case to address real workload issues.

The main functions of the mthp_ext are as follows:

- When sub-cgroup is under high memory pressure (default, full 100ms 1s),
  it will automatically fallback to using 4KB.
- When the anon+shmem memory usage of sub-cgroup falls below the minimum
  memory (default 16MB), small-memory processes will automatically
  fallback to using 4KB.
- Under normal conditions, when there is no memory pressure and the
  anon+shmem memory usage exceeds the minimum memory, all mTHP sizes
  shall be utilized by kernel.
- Monitor the root-cgroup (/sys/fs/cgroup) directory by default, with
  support for specifying any cgroup directory.

Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
 samples/bpf/.gitignore     |   1 +
 samples/bpf/Makefile       |   7 +-
 samples/bpf/mthp_ext.bpf.c | 148 ++++++++++++++++
 samples/bpf/mthp_ext.c     | 339 +++++++++++++++++++++++++++++++++++++
 samples/bpf/mthp_ext.h     |  30 ++++
 5 files changed, 524 insertions(+), 1 deletion(-)
 create mode 100644 samples/bpf/mthp_ext.bpf.c
 create mode 100644 samples/bpf/mthp_ext.c
 create mode 100644 samples/bpf/mthp_ext.h

diff --git a/samples/bpf/.gitignore b/samples/bpf/.gitignore
index 0002cd359fb1..2a73581876b4 100644
--- a/samples/bpf/.gitignore
+++ b/samples/bpf/.gitignore
@@ -49,3 +49,4 @@ iperf.*
 /vmlinux.h
 /bpftool/
 /libbpf/
+mthp_ext
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 95a4fa1f1e44..357c7d1c45ef 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -37,6 +37,7 @@ tprogs-y += xdp_fwd
 tprogs-y += task_fd_query
 tprogs-y += ibumad
 tprogs-y += hbm
+tprogs-y += mthp_ext
 
 # Libbpf dependencies
 LIBBPF_SRC = $(TOOLS_PATH)/lib/bpf
@@ -122,6 +123,7 @@ always-y += task_fd_query_kern.o
 always-y += ibumad_kern.o
 always-y += hbm_out_kern.o
 always-y += hbm_edt_kern.o
+always-y += mthp_ext.bpf.o
 
 COMMON_CFLAGS = $(TPROGS_USER_CFLAGS)
 TPROGS_LDFLAGS = $(TPROGS_USER_LDFLAGS)
@@ -289,6 +291,8 @@ $(obj)/hbm_out_kern.o: $(src)/hbm.h $(src)/hbm_kern.h
 $(obj)/hbm.o: $(src)/hbm.h
 $(obj)/hbm_edt_kern.o: $(src)/hbm.h $(src)/hbm_kern.h
 
+mthp_ext: $(obj)/mthp_ext.skel.h
+
 # Override includes for xdp_sample_user.o because $(srctree)/usr/include in
 # TPROGS_CFLAGS causes conflicts
 XDP_SAMPLE_CFLAGS += -Wall -O2 \
@@ -347,10 +351,11 @@ $(obj)/%.bpf.o: $(src)/%.bpf.c $(obj)/vmlinux.h $(src)/xdp_sample.bpf.h $(src)/x
 		-I$(LIBBPF_INCLUDE) $(CLANG_SYS_INCLUDES) \
 		-c $(filter %.bpf.c,$^) -o $@
 
-LINKED_SKELS := xdp_router_ipv4.skel.h
+LINKED_SKELS := xdp_router_ipv4.skel.h mthp_ext.skel.h
 clean-files += $(LINKED_SKELS)
 
 xdp_router_ipv4.skel.h-deps := xdp_router_ipv4.bpf.o xdp_sample.bpf.o
+mthp_ext.skel.h-deps := mthp_ext.bpf.o
 
 LINKED_BPF_SRCS := $(patsubst %.bpf.o,%.bpf.c,$(foreach skel,$(LINKED_SKELS),$($(skel)-deps)))
 
diff --git a/samples/bpf/mthp_ext.bpf.c b/samples/bpf/mthp_ext.bpf.c
new file mode 100644
index 000000000000..3524dc45fda4
--- /dev/null
+++ b/samples/bpf/mthp_ext.bpf.c
@@ -0,0 +1,148 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include "mthp_ext.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_core_read.h>
+#include <vdso/bits.h>
+
+struct mem_info {
+	unsigned long long stall;
+	unsigned int order;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_CGRP_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, struct mem_info);
+} cgrp_storage SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_RINGBUF);
+	__uint(max_entries, 256 * 1024);
+} events SEC(".maps");
+
+struct config_local configs;
+
+/*
+ * mthp_choose_impl - Choose the custom mTHP orders, read order from cgrp_storage,
+ *		      which is Adjustment by the cgroup_scan().
+ * @cgrp: control group
+ * @orders: original orders
+ *
+ * Return suited mTHP orders.
+ */
+SEC("struct_ops/mthp_choose")
+unsigned long BPF_PROG(mthp_choose_impl, struct cgroup *cgrp, unsigned long orders)
+{
+	struct mem_info *info;
+	unsigned int order;
+
+	if (configs.fixed) {
+		order = configs.init_order;
+		goto out;
+	}
+
+	info = bpf_cgrp_storage_get(&cgrp_storage, cgrp, 0, 0);
+	if (!info)
+		return orders;
+
+	order = info->order;
+out:
+	if (!order)
+		return 0;
+
+	orders &= BIT(order + 1) - 1;
+	return orders;
+}
+
+SEC(".struct_ops.link")
+struct bpf_mthp_ops mthp_ops = {
+	.mthp_choose = (void *)mthp_choose_impl,
+};
+
+/* backport from kernel/cgroup/cgroup.c */
+static bool cgroup_has_tasks(struct cgroup *cgrp)
+{
+	return cgrp->nr_populated_csets;
+}
+
+/*
+ * cgroup_scan - scan all descendant cgroups under root cgroup.
+ *
+ * 1. When the memory usage of the sub-cgroup falls below the <min> threshold,
+ *    it will automatically fall back to using 4KB size; otherwise, it will
+ *    use all mTHP sizes.
+ * 2. When memory.pressure stall time of the sub-cgroup exceeds <threshold>,
+ *    it will automatically fall back to using 4KB size; otherwise, it will
+ *    use all mTHP sizes.
+ *
+ * Return 1 indicates termination of the iteration loop, and return 0 indicates
+ * iteration to the next sub-cgroup.
+ */
+SEC("iter.s/cgroup")
+int cgroup_scan(struct bpf_iter__cgroup *ctx)
+{
+	struct cgroup *cgrp = ctx->cgroup;
+	struct mem_cgroup *memcg;
+	struct mem_info *info;
+	struct alert_event *e;
+	unsigned long curr_mem;
+	unsigned long long curr_stall;
+	unsigned long long delta;
+
+	if (!cgrp)
+		return 1;
+
+	if (!cgroup_has_tasks(cgrp))
+		return 0;
+
+	info = bpf_cgrp_storage_get(&cgrp_storage, cgrp, 0,
+				    BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (!info)
+		return 0;
+
+	memcg = bpf_get_mem_cgroup(&cgrp->self);
+	if (!memcg)
+		return 0;
+
+	bpf_cgroup_flush_stats(cgrp);
+	curr_stall = bpf_cgroup_stall(cgrp, PSI_MEM_FULL);
+	if (!info->stall) {
+		info->order = configs.init_order;
+		goto UPDATE;
+	}
+	delta = curr_stall - info->stall;
+	bpf_mem_cgroup_flush_stats(memcg);
+	curr_mem = bpf_mem_cgroup_page_state(memcg, NR_ANON_MAPPED) +
+		   bpf_mem_cgroup_page_state(memcg, NR_SHMEM);
+	if ((curr_mem && curr_mem < FROM_MB(configs.min_mem)) ||
+	     delta >= configs.threshold)
+		info->order = 0;
+	else
+		info->order = PMD_ORDER;
+
+	if (configs.debug) {
+		e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
+		if (e) {
+			e->prev_stall = info->stall;
+			e->curr_stall = curr_stall;
+			e->delta = delta;
+			e->mem = curr_mem;
+			e->order = info->order;
+			bpf_probe_read_kernel_str(e->name, sizeof(e->name),
+						  cgrp->kn->name);
+			bpf_ringbuf_submit(e, 0);
+		}
+	}
+
+UPDATE:
+	info->stall = curr_stall;
+	bpf_put_mem_cgroup(memcg);
+
+	return 0;
+}
+
+char LICENSE[] SEC("license") = "GPL";
diff --git a/samples/bpf/mthp_ext.c b/samples/bpf/mthp_ext.c
new file mode 100644
index 000000000000..120c331ff26a
--- /dev/null
+++ b/samples/bpf/mthp_ext.c
@@ -0,0 +1,339 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <string.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <signal.h>
+#include <time.h>
+#include <stdbool.h>
+#include <getopt.h>
+#include <sys/epoll.h>
+#include <sys/stat.h>
+#include <linux/limits.h>
+#include <linux/bpf.h>
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+#include "mthp_ext.h"
+#include "mthp_ext.skel.h"
+
+#define DEFAULT_ROOT		"/sys/fs/cgroup"
+#define DEFAULT_THRESHOLD_MS	100UL
+#define DEFAULT_INTERVAL_MS	1000UL
+#define DEFAULT_ORDER		PMD_ORDER
+#define DEFAULT_MIN_MEM		16
+
+static bool exiting;
+
+static void usage(const char *name)
+{
+	fprintf(stderr,
+		"Usage: %s [OPTIONS]\n\n"
+		"Monitor specified cgroup, adjust mTHP size via cgroup_bpf.\n\n"
+		"Currently supports fixed mTHP size and automatic mTHP size adjustment.\n"
+		"By default, it monitors the entire cgroup and automatically\n"
+		"adjusts mTHP size within the specified time window <interval>.\n"
+		"1. When the memory size of the sub-cgroup falls below\n"
+		"   the <min> threshold, it will automatically fall back to\n"
+		"   using 4KB size; otherwise, it will use all mTHP sizes.\n"
+		"2. When memory.pressure stall time of the sub-cgroup exceeds\n"
+		"   <threshold>, it will automatically fall back to using 4KB\n"
+		"   size; otherwise, it will use all mTHP sizes.\n\n"
+		"Options:\n"
+		"  -r, --root=PATH        Root cgroup path (default: /sys/fs/cgroup)\n"
+		"  -t, --threshold=MS     threshold in ms (default: %lu)\n"
+		"  -i, --interval=MS      interval in ms (default: %lu)\n"
+		"  -o, --order=NR         Initial mthp order (default: %d)\n"
+		"  -m, --min=MB           Minimum memory size for mTHP (default: %d)\n"
+		"  -f, --fixed            Use fixed order, disable auto-adjustment\n"
+		"  -d, --debug            Enable debug output\n"
+		"  -h, --help             Show this help\n",
+		name, DEFAULT_THRESHOLD_MS, DEFAULT_INTERVAL_MS, DEFAULT_ORDER,
+		DEFAULT_MIN_MEM);
+}
+
+static void sig_handler(int sig)
+{
+	exiting = true;
+}
+
+static int setup_psi_trigger(const char *cgroup_path, const char *type,
+			     unsigned long stall_us, unsigned long window_us)
+{
+	char path[PATH_MAX];
+	char trigger[128];
+	int fd, nr;
+
+	snprintf(path, sizeof(path), "%s/memory.pressure", cgroup_path);
+	fd = open(path, O_RDWR | O_NONBLOCK);
+	if (fd < 0) {
+		fprintf(stderr, "ERROR: open PSI file failed\n");
+		return -errno;
+	}
+
+	nr = snprintf(trigger, sizeof(trigger), "%s %lu %lu",
+		      type, stall_us, window_us);
+	if (write(fd, trigger, nr) < 0) {
+		fprintf(stderr, "ERROR: write PSI trigger failed\n");
+		close(fd);
+		return -errno;
+	}
+
+	return fd;
+}
+
+static int trigger_scan(struct bpf_link *iter_link)
+{
+	char buf[256];
+	int fd;
+
+	fd = bpf_iter_create(bpf_link__fd(iter_link));
+	if (fd < 0) {
+		fprintf(stderr, "ERROR: bpf_iter_create failed: %s\n",
+			strerror(errno));
+		return -1;
+	}
+
+	/* Read to trigger the iter program execution */
+	while (read(fd, buf, sizeof(buf)) > 0)
+		;
+
+	close(fd);
+	return 0;
+}
+
+static void *monitor_thread(int psi_fd, struct config_local *configs,
+		struct bpf_link *iter_link, struct ring_buffer *rb)
+{
+	struct epoll_event e;
+	int epoll_fd;
+	int nfds;
+
+	epoll_fd = epoll_create1(0);
+	if (epoll_fd < 0) {
+		fprintf(stderr, "ERROR: epoll_create1 failed\n");
+		return NULL;
+	}
+
+	e.events = EPOLLPRI;
+	e.data.fd = psi_fd;
+	if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, psi_fd, &e)) {
+		fprintf(stderr, "ERROR: epoll_ctl failed\n");
+		goto CLOSE;
+	}
+
+	/* First initialization */
+	trigger_scan(iter_link);
+
+	/* Auto adjustment */
+	while (!exiting) {
+		nfds = epoll_wait(epoll_fd, &e, 1, configs->interval * 2);
+		trigger_scan(iter_link);
+
+		if (configs->debug) {
+			printf("PSI: memory pressure %s\n", nfds ? "high" : "low");
+			ring_buffer__poll(rb, 0);
+		}
+	}
+
+CLOSE:
+	close(epoll_fd);
+	return NULL;
+}
+
+static int handle_event(void *ctx, void *data, size_t len)
+{
+	struct alert_event *e = data;
+
+	printf("cgroup %s: stall %llu -> %llu (+%llu), mem %luMB, mthp order=%d\n",
+		e->name[0] ? e->name : "/",
+		e->prev_stall, e->curr_stall, e->delta, TO_MB(e->mem), e->order);
+
+	return 0;
+}
+
+int main(int argc, char **argv)
+{
+	const char *root_path = DEFAULT_ROOT;
+	unsigned long threshold = DEFAULT_THRESHOLD_MS;
+	unsigned long interval = DEFAULT_INTERVAL_MS;
+	unsigned int init_order = DEFAULT_ORDER;
+	unsigned int min_mem = DEFAULT_MIN_MEM;
+	bool fixed = false;
+	bool debug = false;
+	struct mthp_ext *skel;
+	struct bpf_link *iter_link;
+	struct bpf_link *ops_link;
+	struct ring_buffer *rb;
+	int root_fd;
+	int psi_fd;
+	int err = 0;
+	int opt;
+
+	static struct option long_options[] = {
+		{"root",       required_argument, 0, 'r'},
+		{"threshold",  required_argument, 0, 't'},
+		{"interval",   required_argument, 0, 'i'},
+		{"order",      required_argument, 0, 'o'},
+		{"min",        required_argument, 0, 'm'},
+		{"fixed",      no_argument,       0, 'f'},
+		{"debug",      no_argument,       0, 'd'},
+		{"help",       no_argument,       0, 'h'},
+		{0, 0, 0, 0}
+	};
+
+	while ((opt = getopt_long(argc, argv, "r:t:i:o:m:fdh",
+				  long_options, NULL)) != -1) {
+		switch (opt) {
+		case 'r':
+			root_path = optarg;
+			break;
+		case 't':
+			threshold = strtoul(optarg, NULL, 10);
+			break;
+		case 'i':
+			interval = strtoul(optarg, NULL, 10);
+			break;
+		case 'o':
+			init_order = min(strtoul(optarg, NULL, 10), PMD_ORDER);
+			break;
+		case 'm':
+			min_mem = strtoul(optarg, NULL, 10);
+			break;
+		case 'f':
+			fixed = true;
+			break;
+		case 'd':
+			debug = true;
+			break;
+		case 'h':
+			usage(argv[0]);
+			return 0;
+		default:
+			usage(argv[0]);
+			return -EINVAL;
+		}
+	}
+
+	if (!threshold || !interval) {
+		fprintf(stderr, "ERROR: threshold and interval must be > 0\n");
+		usage(argv[0]);
+		return -EINVAL;
+	}
+
+	signal(SIGINT, sig_handler);
+	signal(SIGTERM, sig_handler);
+
+	root_fd = open(root_path, O_RDONLY);
+	if (root_fd < 0) {
+		fprintf(stderr, "ERROR: open '%s' failed: %s\n",
+			root_path, strerror(errno));
+		return -errno;
+	}
+
+	skel = mthp_ext__open();
+	if (!skel) {
+		fprintf(stderr, "ERROR: failed to open BPF skeleton\n");
+		err = -ENOMEM;
+		goto open_skel_fail;
+	}
+
+	skel->bss->configs.threshold = threshold;
+	skel->bss->configs.interval = interval;
+	skel->bss->configs.init_order = init_order;
+	skel->bss->configs.min_mem = min_mem;
+	skel->bss->configs.fixed = fixed;
+	skel->bss->configs.debug = debug;
+
+	err = mthp_ext__load(skel);
+	if (err) {
+		fprintf(stderr, "ERROR: failed to load BPF program: %d\n", err);
+		goto load_skel_fail;
+	}
+
+	/* Attach struct_ops to root cgroup for mthp_choose */
+	DECLARE_LIBBPF_OPTS(bpf_struct_ops_opts, opts);
+	opts.flags = BPF_F_CGROUP_FD;
+	opts.target_fd = root_fd;
+	ops_link = bpf_map__attach_struct_ops_opts(skel->maps.mthp_ops, &opts);
+	err = libbpf_get_error(ops_link);
+	if (err) {
+		fprintf(stderr, "ERROR: attach struct_ops failed: %d\n", err);
+		ops_link = NULL;
+		goto attach_opts_fail;
+	}
+
+	printf("Monitoring         : %s\n"
+	       "threshold          : %lums\n"
+	       "Interval           : %lums\n"
+	       "Initial order      : %d%s\n"
+	       "min memory         : %dMB\n"
+	       "Debug              : %s\n"
+	       "Press Ctrl+C to exit.\n\n",
+	       root_path, threshold, interval, init_order,
+	       fixed ? " (fixed)" : " (auto)", min_mem,
+	       debug ? "on" : "off");
+
+	if (fixed) {
+		while (!exiting)
+			usleep(interval * 1000);
+		goto exit_fixed;
+	}
+
+	/* Auto adjustment, attach cgroup iter for scanning root + descendants */
+	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, iter_opts);
+	union bpf_iter_link_info linfo = {
+		.cgroup.cgroup_fd = root_fd,
+		.cgroup.order = BPF_CGROUP_ITER_DESCENDANTS_PRE,
+	};
+	iter_opts.link_info = &linfo;
+	iter_opts.link_info_len = sizeof(linfo);
+	iter_link = bpf_program__attach_iter(skel->progs.cgroup_scan, &iter_opts);
+	err = libbpf_get_error(iter_link);
+	if (err) {
+		fprintf(stderr, "ERROR: attach cgroup iter failed: %d\n", err);
+		iter_link = NULL;
+		goto attach_iter_fail;
+	}
+
+	/* Set up ring buffer for receiving alerts */
+	rb = ring_buffer__new(bpf_map__fd(skel->maps.events),
+			      handle_event, NULL, NULL);
+	if (!rb) {
+		fprintf(stderr, "ERROR: failed to create ring buffer\n");
+		err = -ENOMEM;
+		goto rb_fail;
+	}
+
+
+	psi_fd = setup_psi_trigger(root_path, "some", threshold * 1000,
+				   interval * 1000);
+	if (psi_fd < 0) {
+		fprintf(stderr, "ERROR: PSI trigger setup failed\n");
+		err = -EINVAL;
+		goto psi_setup_fail;
+	}
+
+	monitor_thread(psi_fd, &skel->bss->configs, iter_link, rb);
+
+	close(psi_fd);
+psi_setup_fail:
+	ring_buffer__free(rb);
+rb_fail:
+	bpf_link__destroy(iter_link);
+exit_fixed:
+attach_iter_fail:
+	bpf_link__destroy(ops_link);
+attach_opts_fail:
+load_skel_fail:
+	mthp_ext__destroy(skel);
+open_skel_fail:
+	close(root_fd);
+
+	printf("\nExiting...\n");
+
+	return err;
+}
diff --git a/samples/bpf/mthp_ext.h b/samples/bpf/mthp_ext.h
new file mode 100644
index 000000000000..e29d80aa15bf
--- /dev/null
+++ b/samples/bpf/mthp_ext.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __MTHP_EXT_H__
+#define __MTHP_EXT_H__
+
+#define CGROUP_NAME_LEN 128
+#define PMD_ORDER	9
+#define min(a, b)	((a) < (b) ? a : b)
+#define FROM_MB(s)	(s * 1024UL * 1024UL)
+#define TO_MB(s)	(s / 1024UL / 1024UL)
+
+struct config_local {
+	unsigned long threshold;
+	unsigned long interval;
+	unsigned int  init_order;
+	unsigned int  min_mem;
+	bool fixed;
+	bool debug;
+};
+
+struct alert_event {
+	unsigned long long prev_stall;
+	unsigned long long curr_stall;
+	unsigned long long delta;
+	unsigned long mem;
+	unsigned int  order;
+	char name[CGROUP_NAME_LEN];
+};
+
+#endif /* __MTHP_EXT_H__ */
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 4/4] samples: bpf: add mthp_ext
  2026-05-08 15:00 ` [PATCH v2 4/4] samples: bpf: add mthp_ext Vernon Yang
@ 2026-05-08 15:40   ` bot+bpf-ci
  2026-05-08 22:52   ` sashiko-bot
  1 sibling, 0 replies; 22+ messages in thread
From: bot+bpf-ci @ 2026-05-08 15:40 UTC (permalink / raw)
  To: vernon2gm, akpm, david, ljs, roman.gushchin, inwardvessel,
	shakeel.butt, ast, daniel, surenb
  Cc: tz2294, baohua, lance.yang, dev.jain, laoar.shao, gutierrez.asier,
	linux-kernel, linux-mm, bpf, yanglincheng, ast, andrii, daniel,
	martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 2847 bytes --]

> diff --git a/samples/bpf/.gitignore b/samples/bpf/.gitignore
> index 0002cd359fb1..2a73581876b4 100644
> --- a/samples/bpf/.gitignore
> +++ b/samples/bpf/.gitignore

[ ... ]

> diff --git a/samples/bpf/mthp_ext.c b/samples/bpf/mthp_ext.c
> new file mode 100644
> index 000000000000..120c331ff26a
> --- /dev/null
> +++ b/samples/bpf/mthp_ext.c

[ ... ]

> @@ -0,0 +1,339 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <stdint.h>
> +#include <string.h>
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <unistd.h>
> +#include <signal.h>
> +#include <time.h>
> +#include <stdbool.h>
> +#include <getopt.h>
> +#include <sys/epoll.h>
> +#include <sys/stat.h>
> +#include <linux/limits.h>
> +#include <linux/bpf.h>
> +#include <bpf/bpf.h>
> +#include <bpf/libbpf.h>
> +#include "mthp_ext.h"
> +#include "mthp_ext.skel.h"

[ ... ]

> +int main(int argc, char **argv)
> +{
> +	const char *root_path = DEFAULT_ROOT;
> +	unsigned long threshold = DEFAULT_THRESHOLD_MS;
> +	unsigned long interval = DEFAULT_INTERVAL_MS;
> +	unsigned int init_order = DEFAULT_ORDER;
> +	unsigned int min_mem = DEFAULT_MIN_MEM;
> +	bool fixed = false;
> +	bool debug = false;
> +	struct mthp_ext *skel;
> +	struct bpf_link *iter_link;
> +	struct bpf_link *ops_link;
> +	struct ring_buffer *rb;
> +	int root_fd;
> +	int psi_fd;
> +	int err = 0;
> +	int opt;

[ ... ]

> +	/* Attach struct_ops to root cgroup for mthp_choose */
> +	DECLARE_LIBBPF_OPTS(bpf_struct_ops_opts, opts);
> +	opts.flags = BPF_F_CGROUP_FD;
> +	opts.target_fd = root_fd;
> +	ops_link = bpf_map__attach_struct_ops_opts(skel->maps.mthp_ops, &opts);
> +	err = libbpf_get_error(ops_link);

Will mthp_ext compile with these symbols?  bot+bpf-ci@kernel.org noted in
v1 that bpf_struct_ops_opts, BPF_F_CGROUP_FD, and
bpf_map__attach_struct_ops_opts() do not exist anywhere in the kernel
source tree:

https://lore.kernel.org/bpf/c13f99d42c434ec73ec1e8cf1141047146ab359e60e56e78e1c32554d6e01e9b@mail.kernel.org/

The struct bpf_struct_ops_opts type is not defined in
tools/lib/bpf/libbpf.h.

The BPF_F_CGROUP_FD flag does not appear in include/uapi/linux/bpf.h or
any other header.

The bpf_map__attach_struct_ops_opts() function is not declared in
libbpf.h or exported in tools/lib/bpf/libbpf.map.

You mentioned in your response that this depends on patches from the 'mm:
BPF OOM' series (PATCH#2), but this dependency was not addressed in v2 -
the same non-existent API calls remain in the code.

Should this series explicitly list the dependency on the 'mm: BPF OOM'
PATCH#2 in the cover letter?

[ ... ]


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/25563319416

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 4/4] samples: bpf: add mthp_ext
  2026-05-08 15:00 ` [PATCH v2 4/4] samples: bpf: add mthp_ext Vernon Yang
  2026-05-08 15:40   ` bot+bpf-ci
@ 2026-05-08 22:52   ` sashiko-bot
  1 sibling, 0 replies; 22+ messages in thread
From: sashiko-bot @ 2026-05-08 22:52 UTC (permalink / raw)
  To: Vernon Yang; +Cc: bpf

Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:
- [High] Small memory processes bypass the 4KB fallback rule due to a faulty condition that excludes zero memory usage.
- [High] The cgroup memory check is indefinitely bypassed for cgroups that do not experience memory pressure because a stall value of `0` is treated as uninitialized.
- [Medium] The BPF program submits uninitialized memory to the ring buffer if reading the cgroup name fails, which can crash the userspace handler.
- [Medium] The global `exiting` flag is accessed locklessly without the `volatile sig_atomic_t` qualifier, potentially leading to an infinite loop.
--

commit d747c2eece6f13195b37f744a5f3e9fd6d9a2c95
Author: Vernon Yang <yanglincheng@kylinos.cn>

samples: bpf: add mthp_ext

Design mthp_ext case to address real workload issues by adjusting mTHP size
via a cgroup BPF program.

> diff --git a/samples/bpf/mthp_ext.bpf.c b/samples/bpf/mthp_ext.bpf.c
> new file mode 100644
> --- /dev/null
> +++ b/samples/bpf/mthp_ext.bpf.c
[ ... ]
> +SEC("iter.s/cgroup")
> +int cgroup_scan(struct bpf_iter__cgroup *ctx)
> +{
[ ... ]
> +	bpf_cgroup_flush_stats(cgrp);
> +	curr_stall = bpf_cgroup_stall(cgrp, PSI_MEM_FULL);
> +	if (!info->stall) {

Does this logic unintentionally bypass the memory check for cgroups without
memory pressure?

bpf_cgroup_stall() can return exactly 0 for cgroups that have not experienced
any memory pressure. If curr_stall is 0, info->stall will be set to 0.

On subsequent scans, !info->stall will evaluate to true again, continuously
resetting the order to PMD_ORDER and skipping the memory size evaluation
below.

> +		info->order = configs.init_order;
> +		goto UPDATE;
> +	}
> +	delta = curr_stall - info->stall;
> +	bpf_mem_cgroup_flush_stats(memcg);
> +	curr_mem = bpf_mem_cgroup_page_state(memcg, NR_ANON_MAPPED) +
> +		   bpf_mem_cgroup_page_state(memcg, NR_SHMEM);
> +	if ((curr_mem && curr_mem < FROM_MB(configs.min_mem)) ||

Could this condition allow new processes to bypass the 4KB fallback rule?

When a process starts and has not faulted in any anonymous or shmem memory,
its memory usage is 0. The curr_mem && check causes this to evaluate to false,
falling through to the else branch and assigning PMD_ORDER.

This seems to allow small or starting processes to allocate 2MB huge pages
for their first allocations instead of falling back to 4KB.

> +	     delta >= configs.threshold)
> +		info->order = 0;
> +	else
> +		info->order = PMD_ORDER;
> +
> +	if (configs.debug) {
> +		e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
> +		if (e) {
> +			e->prev_stall = info->stall;
> +			e->curr_stall = curr_stall;
> +			e->delta = delta;
> +			e->mem = curr_mem;
> +			e->order = info->order;
> +			bpf_probe_read_kernel_str(e->name, sizeof(e->name),
> +						  cgrp->kn->name);

Is it possible for e->name to contain uninitialized garbage if
bpf_probe_read_kernel_str() fails?

bpf_ringbuf_reserve() does not zero-initialize the memory. If the string
read fails, e->name might lack a null terminator. This could cause a
segmentation fault in the userspace daemon when it processes the event using
printf.

> +			bpf_ringbuf_submit(e, 0);
> +		}
> +	}
[ ... ]
> diff --git a/samples/bpf/mthp_ext.c b/samples/bpf/mthp_ext.c
> new file mode 100644
> --- /dev/null
> +++ b/samples/bpf/mthp_ext.c
[ ... ]
> +static bool exiting;

Should this be declared as volatile sig_atomic_t?

Since exiting is modified asynchronously in the signal handler and read in the
while (!exiting) loops, an optimizing compiler might hoist the read out of
the loop. This could potentially cause the program to hang indefinitely
instead of exiting on a signal.

> +
> +static void usage(const char *name)
> +{
[ ... ]

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260508150055.680136-1-vernon2gm@gmail.com?part=4

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent
  2026-05-08 15:00 [PATCH v2 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent Vernon Yang
                   ` (3 preceding siblings ...)
  2026-05-08 15:00 ` [PATCH v2 4/4] samples: bpf: add mthp_ext Vernon Yang
@ 2026-05-08 15:14 ` Lorenzo Stoakes
  2026-05-08 16:05   ` Lorenzo Stoakes
  2026-05-08 16:00 ` Pedro Falcato
  5 siblings, 1 reply; 22+ messages in thread
From: Lorenzo Stoakes @ 2026-05-08 15:14 UTC (permalink / raw)
  To: Vernon Yang
  Cc: akpm, david, roman.gushchin, inwardvessel, shakeel.butt, ast,
	daniel, surenb, tz2294, baohua, lance.yang, dev.jain, laoar.shao,
	gutierrez.asier, linux-kernel, linux-mm, bpf, Vernon Yang

Thanks for the series, but overall it's got to be no to this until THP and mTHP
are in more stable shape.

And this is an RFC, you're trying to make really fundamental changes here, it's
almost... rude to do that out of the blue non-RFC'd (unless you're a maintainer
perhaps).

Right now the THP code base is a total mess and mTHP support is not even
properly merged yet (khugepaged support outstanding).

BPF interfaces are permanent, we've tried the 'experimental' thing before, it
doesn't work and we'll not be able to yank it later.

I've said it before, but we really truly need to get THP into better shape
before we can tolerate large new changes, let alone an user-exported interface.

So can we defer this until we're in better shape, and then send that as an RFC
first please?

On Fri, May 08, 2026 at 11:00:51PM +0800, Vernon Yang wrote:
> From: Vernon Yang <yanglincheng@kylinos.cn>
>
> Hi all,
>
> Background
> ==========
>
> As is well known, a system can simultaneously run multiple different
> scenarios. However, THP is not beneficial in every scenario — it is only
> most suitable for memory-intensive applications that are not sensitive
> to tail latency. For example, Redis, which is sensitive to tail latency,
> is not suitable for THP. But in practice, due to Redis issues, the
> entire THP functionality is often turned off, preventing other scenarios
> from benefiting from it.
>
> There are also some embedded scenarios (e.g. Android) that directly use
> 2MB THP, where the granularity is too large. Therefore, we introduced
> mTHP in v6.8, which supports multiple-size THP. In practice, however, we
> still globally fix a single mTHP size and are unable to automatically
> select different mTHP sizes based on different scenarios.
>
> After testing, it was found that
>
> - When the system has a lot of free memory, it is normal for Redis to
>   use mTHP. performance degradation in Redis only occurs when the system
>   is under high memory pressure.
> - Additionally, when a large number of small-memory processes use mTHP,
>   memory waste is prone to occur, and performance degradation may also
>   happen during fast memory allocation/release.
>
> Previously, "Cgroup-based THP control"[1] was proposed, but it had the
> following issues.
>
> - It breaks the cgroup hierarchy property.
> - Add new THP knobs, making sysadmin's job more complex
>
> Previously, "mm, bpf: BPF-MM, BPF-THP"[2] was proposed, but it had the
> following issues.
>
> - It didn't address the issue on the per-process mode.
> - For global mode, the prctl(PR_SET_THP_DISABLE) has already achieved
>   the same objective, there is no need to add two mechanisms for the
>   same purpose.
> - Attaching st_ops to mm_struct, the same issues that cgroup-bpf once
>   faced are likely to arise again, e.g. lifetime of cgroup vs bpf, dying
>   cgroups, wq deadlock, etc. It is recommended to use cgroup-bpf for
>   implementation.
> - Unclear ABI stability guarantees.

Not unclear, any BPF interface is permament.

> - The test cases are too simplistic, lacking eBPF cases similar to real
>   workloads such as sched_ext.
>
> If I miss some thing, please let me know. Thanks!
>
> Solution
> ========
>
> This series will solve all the problems mentioned above.
>
> 1. Using cgroup-bpf to customize mTHP size for different scenarios
> 2. Use a cgroup eBPF program to monitor all sub-cgroups. Sub-cgroups
>    under the same parent-cgroup adopt the same eBPF program. Only multiple
>    sibling-cgroups (where the parent-cgroup has no attached eBPF program)
>    are supported to attach multiple different eBPF programs without
>    breaking the hierarchy property of the cgroup.
> 3. Automatically select different mTHP sizes for different cgroups,
>    let's focus on making them truly transparent.

I don't see how cgroup level control is transparent :) this overall seems like
THP control at cgroup level by the back door, and I thought the cgroup people
were adamently against that.

Personally I think we should actually allow less 'transparent' THP but that's a
debatable subject obviously.

> 4. Design mthp_ext case to address real workload issues and further
>    clear/stabilize the ABI.
>
> The main functions of the mthp_ext are as follows:
>
> - When sub-cgroup is under high memory pressure (default, full 100ms 1s),
>   it will automatically fallback to using 4KB.
> - When the anon+shmem memory usage of sub-cgroup falls below the minimum
>   memory (default 16MB), small-memory processes will automatically
>   fallback to using 4KB.
> - Under normal conditions, when there is no memory pressure and the
>   anon+shmem memory usage exceeds the minimum memory, all mTHP sizes
>   shall be utilized by kernel.
> - Monitor the root-cgroup (/sys/fs/cgroup) directory by default, with
>   support for specifying any cgroup directory.

This seems like something prescriptive rather than 'bpf lets you make a
decision' and cgroup-level THP behaviour changes? It seems really out of scope.

>
> Performance
> ===========
>
> The below is some performance test results, testing on x86_64 machine
> (AMD Ryzen9 9950X 16C32T, 32G memory, 8G zram).
>
> NOTE: The following always/never labels indicate setting all mTHP sizes
> to always/never. Detailed test script reference[4].
>
> redis results
> ~~~~~~~~~~~~~
>
> command: redis-benchmark --csv -r 3000000 -n 3000000 -d 1024 -c 16 -P 32 -t set
>
> When cgroup memory.high=max, no memory pressure, seems only noise level
> changes, mthp_ext no regression.
>
> | redis-noBGSAVE | always      | never                | always+mthp_ext     |
> |----------------|-------------|----------------------|---------------------|
> | rps            | 1431307.083 | 1224004.250 (-14.5%) | 1420053.873 (-0.8%) |
> | avg_latency_ms | 0.216       | 0.256       (-18.5%) | 0.218       (-0.9%) |
> | p95_latency_ms | 0.612       | 0.708       (-15.7%) | 0.615       (-0.5%) |
> | p99_latency_ms | 0.682       | 0.812       (-19.1%) | 0.692       (-1.5%) |
>
> | redis-BGSAVE   | always      | never                | always+mthp_ext    |
> |----------------|-------------|----------------------|--------------------|
> | rps            | 1429093.707 | 1231569.587 (-13.8%) | 1431075.330 (0.1%) |
> | avg_latency_ms | 0.216       | 0.255       (-18.1%) | 0.216       (0.0%) |
> | p95_latency_ms | 0.618       | 0.706       (-14.2%) | 0.615       (0.5%) |
> | p99_latency_ms | 0.684       | 0.823       (-20.3%) | 0.684       (0.0%) |
>
> When cgroup memory.high=2G, high memory pressure, mthp_ext RPS improve by
> 3450%, while significantly reducing the tail latency by 99%.
>
> | redis-noBGSAVE | always    | never                | always+mthp_ext      |
> |----------------|-----------|----------------------|----------------------|
> | rps            | 24932.790 | 976610.893 (3817.0%) | 885337.250 (3450.9%) |
> | avg_latency_ms | 13.173    | 0.326        (97.5%) | 0.367        (97.2%) |
> | p95_latency_ms | 23.028    | 0.786        (96.6%) | 1.511        (93.4%) |
> | p99_latency_ms | 366.762   | 1.183        (99.7%) | 2.975        (99.2%) |
>
> | redis-BGSAVE   | always    | never                 | always+mthp_ext      |
> |----------------|-----------|-----------------------|----------------------|
> | rps            | 50551.567 | 1026720.293 (1931.0%) | 892643.707 (1665.8%) |
> | avg_latency_ms | 6.581     | 0.310         (95.3%) | 0.365        (94.5%) |
> | p95_latency_ms | 16.730    | 0.772         (95.4%) | 1.447        (91.4%) |
> | p99_latency_ms | 311.551   | 1.140         (99.6%) | 2.988        (99.0%) |
>
> unixbench results
> ~~~~~~~~~~~~~~~~~
>
> command: ./Run -c 1 shell8
>
> mthp_ext improved by 5.99%.
>
> | unixbench shell8 | always  | never           | always+mthp_ext |
> |------------------|---------|-----------------|-----------------|
> | Score            | 22916.8 | 24304.0 (6.05%) | 24289.9 (5.99%) |
>
> kernbench results
> ~~~~~~~~~~~~~~~~~
>
> When cgroup memory.high=max, no memory pressure, seems only noise level
> changes, mthp_ext no regression.
>
>                             always                 never               always+mthp_ext
> Amean     user-32    19702.39 (   0.00%)    18428.90 *   6.46%*    19706.73 (  -0.02%)
> Amean     syst-32     1159.55 (   0.00%)     2252.43 * -94.25%*     1177.48 *  -1.55%*
> Amean     elsp-32      703.28 (   0.00%)      699.10 *   0.59%*      703.99 *  -0.10%*
> BAmean-95 user-32    19701.79 (   0.00%)    18425.01 (   6.48%)    19704.78 (  -0.02%)
> BAmean-95 syst-32     1159.43 (   0.00%)     2251.86 ( -94.22%)     1177.03 (  -1.52%)
> BAmean-95 elsp-32      703.24 (   0.00%)      698.99 (   0.61%)      703.88 (  -0.09%)
> BAmean-99 user-32    19701.79 (   0.00%)    18425.01 (   6.48%)    19704.78 (  -0.02%)
> BAmean-99 syst-32     1159.43 (   0.00%)     2251.86 ( -94.22%)     1177.03 (  -1.52%)
> BAmean-99 elsp-32      703.24 (   0.00%)      698.99 (   0.61%)      703.88 (  -0.09%)
>
> When cgroup memory.high=2G, high memory pressure, mthp_ext improved by 26%.
>
>                             always                 never               always+mthp_ext
> Amean     user-32    20250.65 (   0.00%)    18368.91 *   9.29%*    18681.27 *   7.75%*
> Amean     syst-32    12778.56 (   0.00%)     9636.99 *  24.58%*     9392.65 *  26.50%*
> Amean     elsp-32     1377.55 (   0.00%)     1026.10 *  25.51%*     1019.40 *  26.00%*
> BAmean-95 user-32    20233.75 (   0.00%)    18353.57 (   9.29%)    18678.01 (   7.69%)
> BAmean-95 syst-32    12543.21 (   0.00%)     9612.28 (  23.37%)     9386.83 (  25.16%)
> BAmean-95 elsp-32     1367.82 (   0.00%)     1023.75 (  25.15%)     1018.17 (  25.56%)
> BAmean-99 user-32    20233.75 (   0.00%)    18353.57 (   9.29%)    18678.01 (   7.69%)
> BAmean-99 syst-32    12543.21 (   0.00%)     9612.28 (  23.37%)     9386.83 (  25.16%)
> BAmean-99 elsp-32     1367.82 (   0.00%)     1023.75 (  25.15%)     1018.17 (  25.56%)
>
> TODO
> ====
>
> - mthp_ext handles different "enum tva_type" values. For example, for
>   small-memory processes, only 4KB is used in TVA_PAGEFAULT, while
>   TVA_KHUGEPAGED/TVA_FORCED_COLLAPSE continues to collapse all mthp
>   size. Under high memory pressure, only 4KB is used for
>   TVA_PAGEFAULT/TVA_KHUGEPAGED, while TVA_FORCED_COLLAPSE continues to
>   collapse all mthp size.
> - selftest
>
> If there are additional scenarios, please let me know as well, so I can
> conduct further prototype verification tests to make mTHP more
> transparent and further clear/stabilize the BPF-THP ABI.
>
> If any of the above the strategies can be integrated into the kernel,
> please let me know. I would be delighted to incorporate these strategies
> into the kernel.
>
> This series is based on mm-new + "mm: BPF OOM"[3] first four patches.

Again, this really should have been an RFC, a 'TODO' section shouldn't exist in
a non-RFC series.

>
> Thank you very much for your comments and discussions.
>
> [1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com
> [2] https://lore.kernel.org/linux-mm/20251026100159.6103-1-laoar.shao@gmail.com
> [3] https://lore.kernel.org/linux-mm/20260127024421.494929-1-roman.gushchin@linux.dev
> [4] https://github.com/vernon2gh/app_and_module/tree/main/mthp_ext
>
> V1 -> V2:
> - Rebase on mm-new, run all performance tests again.
> - Register eBPF programs only when no mthp_ops exists in all sub-cgroup, do not
>   destroy the cgroup hierarchy property.
> - Fix newly created cgroups silently bypass the hierarchical BPF mTHP policy.
> - Fix bpf_mthp_choose() UAF due to improper SRCU locking.
> - Add bounds check in bpf_cgroup_stall() and fix return type to u64.
> - Check cgroup_psi() return value.
> - Fix spurious mTHP fallback during initial cgroup scan due to zero-init
>   info->stall.
> - Fix info->order being set to 0 when no processes are running in the cgroup.
> - Fix Compilation fails when CONFIG_CGROUPS=y && CONFIG_PSI=n.
> - Fix NULL pointer dereference of st_link.
> - FIx infinite loop in trigger_scan() when read() returns an error.
> - Fix integer overflow in FROM_MB() macro.
> - Fix setup_psi_trigger() fail, but masks the error code.
>
> V1 : https://lore.kernel.org/linux-mm/20260503165024.1526680-1-vernon2gm@gmail.com/

All well and good, but I don't see any actual review there, another reason to
send this kind of thing as an RFC first please :)


>
> Vernon Yang (4):
>   psi: add psi_group_flush_stats() function
>   bpf: add bpf_cgroup_{flush_stats,stall} function
>   mm: introduce bpf_mthp_ops struct ops
>   samples: bpf: add mthp_ext
>
>  MAINTAINERS                     |   3 +
>  include/linux/bpf_huge_memory.h |  52 +++++
>  include/linux/cgroup-defs.h     |   1 +
>  include/linux/huge_mm.h         |   6 +
>  include/linux/psi.h             |   5 +
>  kernel/bpf/helpers.c            |  34 ++++
>  kernel/cgroup/cgroup.c          |   2 +
>  kernel/sched/psi.c              |  34 +++-
>  mm/Kconfig                      |  14 ++
>  mm/Makefile                     |   1 +
>  mm/bpf_huge_memory.c            | 168 ++++++++++++++++
>  samples/bpf/.gitignore          |   1 +
>  samples/bpf/Makefile            |   7 +-
>  samples/bpf/mthp_ext.bpf.c      | 148 ++++++++++++++
>  samples/bpf/mthp_ext.c          | 339 ++++++++++++++++++++++++++++++++
>  samples/bpf/mthp_ext.h          |  30 +++
>  16 files changed, 836 insertions(+), 9 deletions(-)
>  create mode 100644 include/linux/bpf_huge_memory.h
>  create mode 100644 mm/bpf_huge_memory.c
>  create mode 100644 samples/bpf/mthp_ext.bpf.c
>  create mode 100644 samples/bpf/mthp_ext.c
>  create mode 100644 samples/bpf/mthp_ext.h
>
> --
> 2.53.0
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent
  2026-05-08 15:14 ` [PATCH v2 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent Lorenzo Stoakes
@ 2026-05-08 16:05   ` Lorenzo Stoakes
  2026-05-08 16:53     ` Vernon Yang
  0 siblings, 1 reply; 22+ messages in thread
From: Lorenzo Stoakes @ 2026-05-08 16:05 UTC (permalink / raw)
  To: Vernon Yang
  Cc: akpm, david, roman.gushchin, inwardvessel, shakeel.butt, ast,
	daniel, surenb, tz2294, baohua, lance.yang, dev.jain, laoar.shao,
	gutierrez.asier, linux-kernel, linux-mm, bpf, Vernon Yang

On Fri, May 08, 2026 at 04:15:04PM +0100, Lorenzo Stoakes wrote:
> Thanks for the series, but overall it's got to be no to this until THP and mTHP
> are in more stable shape.
>
> And this is an RFC, you're trying to make really fundamental changes here, it's
> almost... rude to do that out of the blue non-RFC'd (unless you're a maintainer
> perhaps).
>
> Right now the THP code base is a total mess and mTHP support is not even
> properly merged yet (khugepaged support outstanding).
>
> BPF interfaces are permanent, we've tried the 'experimental' thing before, it
> doesn't work and we'll not be able to yank it later.
>
> I've said it before, but we really truly need to get THP into better shape
> before we can tolerate large new changes, let alone an user-exported interface.
>
> So can we defer this until we're in better shape, and then send that as an RFC
> first please?

Yeah on second thoughts, NACK and don't send this series again please.

I was already annoyed you'd send something this invasive and massive without an
RFC, but you've also ignored the feedback we gave to the last THP BPF series
while ostensibly claiming to have taken it into account.

And then... I mean seriously... _shamelessly_ trying to take control away from
THP maintainers and reviewers who work bloody hard for this community by parking
code that changes mTHP behaviour in an entirely distinct and unrelated
MAINTAINERS section...!

There's a biweekly THP cabal meeting which you didn't raise this in, you didn't
bring this up at any conference, you didn't send an RFC.

You've sent it too before we even have mTHP khugepaged support merged... or have
really stabilised on how mTHP is supposed to work overall.

And also I have made it really abundantly clear that I want to see the technical
debt _paid down_ before we add anything else major.

And as if that wasn't enough, AI review is finding endless problems with this
series on top of all that.

This is NOT how to engage with upstream. Again, please don't send any more
revisions of this.

And next time _engage with the community_ before proposing something this big. A
[DISCUSSION] email, or an RFC, or in a meeting or at a conference, or even
off-list or on-list mail, something.

Lorenzo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent
  2026-05-08 16:05   ` Lorenzo Stoakes
@ 2026-05-08 16:53     ` Vernon Yang
  2026-05-11 11:20       ` Lorenzo Stoakes
  0 siblings, 1 reply; 22+ messages in thread
From: Vernon Yang @ 2026-05-08 16:53 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, roman.gushchin, inwardvessel, shakeel.butt, ast,
	daniel, surenb, tz2294, baohua, lance.yang, dev.jain, laoar.shao,
	gutierrez.asier, linux-kernel, linux-mm, bpf, Vernon Yang

On Sat, May 9, 2026 at 12:05 AM Lorenzo Stoakes <ljs@kernel.org> wrote:
>
> On Fri, May 08, 2026 at 04:15:04PM +0100, Lorenzo Stoakes wrote:
> > Thanks for the series, but overall it's got to be no to this until THP and mTHP
> > are in more stable shape.
> >
> > And this is an RFC, you're trying to make really fundamental changes here, it's
> > almost... rude to do that out of the blue non-RFC'd (unless you're a maintainer
> > perhaps).
> >
> > Right now the THP code base is a total mess and mTHP support is not even
> > properly merged yet (khugepaged support outstanding).
> >
> > BPF interfaces are permanent, we've tried the 'experimental' thing before, it
> > doesn't work and we'll not be able to yank it later.
> >
> > I've said it before, but we really truly need to get THP into better shape
> > before we can tolerate large new changes, let alone an user-exported interface.
> >
> > So can we defer this until we're in better shape, and then send that as an RFC
> > first please?
>
> Yeah on second thoughts, NACK and don't send this series again please.
>
> I was already annoyed you'd send something this invasive and massive without an
> RFC, but you've also ignored the feedback we gave to the last THP BPF series
> while ostensibly claiming to have taken it into account.
>
> And then... I mean seriously... _shamelessly_ trying to take control away from
> THP maintainers and reviewers who work bloody hard for this community by parking
> code that changes mTHP behaviour in an entirely distinct and unrelated
> MAINTAINERS section...!
>
> There's a biweekly THP cabal meeting which you didn't raise this in, you didn't
> bring this up at any conference, you didn't send an RFC.
>
> You've sent it too before we even have mTHP khugepaged support merged... or have
> really stabilised on how mTHP is supposed to work overall.
>
> And also I have made it really abundantly clear that I want to see the technical
> debt _paid down_ before we add anything else major.
>
> And as if that wasn't enough, AI review is finding endless problems with this
> series on top of all that.
>
> This is NOT how to engage with upstream. Again, please don't send any more
> revisions of this.
>
> And next time _engage with the community_ before proposing something this big. A
> [DISCUSSION] email, or an RFC, or in a meeting or at a conference, or even
> off-list or on-list mail, something.

Firstly, before mTHP stabilizes and enters better shape, I will not
submit any new version.

Let me clarify a few issues:
1. This is an RFC. I forgot to add it. Sorry.
2. There is only one issue in the AI review; the rest are false
positives (the AI did not find the dependent patch "mm: BPF OOM").
3. Regarding placing bpf_huge_memory.c under "MEMORY MANAGEMENT
EXTENSIONS": I never intended to take control of THP away from
maintainers and reviewers. However, it is still my fault for causing
misunderstanding. Sorry.

Also, I would like to ask: what work on mTHP still needs further
refinement at present? I can help out.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent
  2026-05-08 16:53     ` Vernon Yang
@ 2026-05-11 11:20       ` Lorenzo Stoakes
  0 siblings, 0 replies; 22+ messages in thread
From: Lorenzo Stoakes @ 2026-05-11 11:20 UTC (permalink / raw)
  To: Vernon Yang
  Cc: akpm, david, roman.gushchin, inwardvessel, shakeel.butt, ast,
	daniel, surenb, tz2294, baohua, lance.yang, dev.jain, laoar.shao,
	gutierrez.asier, linux-kernel, linux-mm, bpf, Vernon Yang

On Sat, May 09, 2026 at 12:53:35AM +0800, Vernon Yang wrote:
> On Sat, May 9, 2026 at 12:05 AM Lorenzo Stoakes <ljs@kernel.org> wrote:
> >
> > On Fri, May 08, 2026 at 04:15:04PM +0100, Lorenzo Stoakes wrote:
> > > Thanks for the series, but overall it's got to be no to this until THP and mTHP
> > > are in more stable shape.
> > >
> > > And this is an RFC, you're trying to make really fundamental changes here, it's
> > > almost... rude to do that out of the blue non-RFC'd (unless you're a maintainer
> > > perhaps).
> > >
> > > Right now the THP code base is a total mess and mTHP support is not even
> > > properly merged yet (khugepaged support outstanding).
> > >
> > > BPF interfaces are permanent, we've tried the 'experimental' thing before, it
> > > doesn't work and we'll not be able to yank it later.
> > >
> > > I've said it before, but we really truly need to get THP into better shape
> > > before we can tolerate large new changes, let alone an user-exported interface.
> > >
> > > So can we defer this until we're in better shape, and then send that as an RFC
> > > first please?
> >
> > Yeah on second thoughts, NACK and don't send this series again please.
> >
> > I was already annoyed you'd send something this invasive and massive without an
> > RFC, but you've also ignored the feedback we gave to the last THP BPF series
> > while ostensibly claiming to have taken it into account.
> >
> > And then... I mean seriously... _shamelessly_ trying to take control away from
> > THP maintainers and reviewers who work bloody hard for this community by parking
> > code that changes mTHP behaviour in an entirely distinct and unrelated
> > MAINTAINERS section...!
> >
> > There's a biweekly THP cabal meeting which you didn't raise this in, you didn't
> > bring this up at any conference, you didn't send an RFC.
> >
> > You've sent it too before we even have mTHP khugepaged support merged... or have
> > really stabilised on how mTHP is supposed to work overall.
> >
> > And also I have made it really abundantly clear that I want to see the technical
> > debt _paid down_ before we add anything else major.
> >
> > And as if that wasn't enough, AI review is finding endless problems with this
> > series on top of all that.
> >
> > This is NOT how to engage with upstream. Again, please don't send any more
> > revisions of this.
> >
> > And next time _engage with the community_ before proposing something this big. A
> > [DISCUSSION] email, or an RFC, or in a meeting or at a conference, or even
> > off-list or on-list mail, something.
>
> Firstly, before mTHP stabilizes and enters better shape, I will not
> submit any new version.
>
> Let me clarify a few issues:
> 1. This is an RFC. I forgot to add it. Sorry.
> 2. There is only one issue in the AI review; the rest are false
> positives (the AI did not find the dependent patch "mm: BPF OOM").
> 3. Regarding placing bpf_huge_memory.c under "MEMORY MANAGEMENT
> EXTENSIONS": I never intended to take control of THP away from
> maintainers and reviewers. However, it is still my fault for causing
> misunderstanding. Sorry.
>
> Also, I would like to ask: what work on mTHP still needs further
> refinement at present? I can help out.

Sorry maybe I overreacted here, long week...!

But in general - yes there's work to be done but what we need help with
above everything else is to pay down technical debt in the THP codebase.

Review also helps :) right now we are adding mTHP support to khugepaged
which is the next 'big thing' for mTHP.

As David has said elsewhere, the _interface_ is the challenge with
BPF. Because we truly want to be sure that interface is the right one and
won't impact our ability to make changes to the implementation of THP as a
whole.

Treating BPF as a de facto permanent uAPI is the way to go I think in
general.

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent
  2026-05-08 15:00 [PATCH v2 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent Vernon Yang
                   ` (4 preceding siblings ...)
  2026-05-08 15:14 ` [PATCH v2 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent Lorenzo Stoakes
@ 2026-05-08 16:00 ` Pedro Falcato
  2026-05-08 16:15   ` Lorenzo Stoakes
  5 siblings, 1 reply; 22+ messages in thread
From: Pedro Falcato @ 2026-05-08 16:00 UTC (permalink / raw)
  To: Vernon Yang
  Cc: akpm, david, ljs, roman.gushchin, inwardvessel, shakeel.butt, ast,
	daniel, surenb, tz2294, baohua, lance.yang, dev.jain, laoar.shao,
	gutierrez.asier, linux-kernel, linux-mm, bpf, Vernon Yang

On Fri, May 08, 2026 at 11:00:51PM +0800, Vernon Yang wrote:
> From: Vernon Yang <yanglincheng@kylinos.cn>
> 
> Hi all,
> 
> Background
> ==========
> 
> As is well known, a system can simultaneously run multiple different
> scenarios. However, THP is not beneficial in every scenario — it is only
> most suitable for memory-intensive applications that are not sensitive
> to tail latency. For example, Redis, which is sensitive to tail latency,
> is not suitable for THP. But in practice, due to Redis issues, the
> entire THP functionality is often turned off, preventing other scenarios
> from benefiting from it.
> 
> There are also some embedded scenarios (e.g. Android) that directly use
> 2MB THP, where the granularity is too large. Therefore, we introduced
> mTHP in v6.8, which supports multiple-size THP. In practice, however, we
> still globally fix a single mTHP size and are unable to automatically
> select different mTHP sizes based on different scenarios.
> 
> After testing, it was found that
> 
> - When the system has a lot of free memory, it is normal for Redis to
>   use mTHP. performance degradation in Redis only occurs when the system
>   is under high memory pressure.
> - Additionally, when a large number of small-memory processes use mTHP,
>   memory waste is prone to occur, and performance degradation may also
>   happen during fast memory allocation/release.
> 
> Previously, "Cgroup-based THP control"[1] was proposed, but it had the
> following issues.
> 
> - It breaks the cgroup hierarchy property.
> - Add new THP knobs, making sysadmin's job more complex
> 
> Previously, "mm, bpf: BPF-MM, BPF-THP"[2] was proposed, but it had the
> following issues.
> 
> - It didn't address the issue on the per-process mode.
> - For global mode, the prctl(PR_SET_THP_DISABLE) has already achieved
>   the same objective, there is no need to add two mechanisms for the
>   same purpose.
> - Attaching st_ops to mm_struct, the same issues that cgroup-bpf once
>   faced are likely to arise again, e.g. lifetime of cgroup vs bpf, dying
>   cgroups, wq deadlock, etc. It is recommended to use cgroup-bpf for
>   implementation.
> - Unclear ABI stability guarantees.
> - The test cases are too simplistic, lacking eBPF cases similar to real
>   workloads such as sched_ext.
> 
> If I miss some thing, please let me know. Thanks!
>
<snip> 
> kernbench results
> ~~~~~~~~~~~~~~~~~
> 
> When cgroup memory.high=max, no memory pressure, seems only noise level
> changes, mthp_ext no regression.
> 
>                             always                 never               always+mthp_ext
> Amean     user-32    19702.39 (   0.00%)    18428.90 *   6.46%*    19706.73 (  -0.02%)
> Amean     syst-32     1159.55 (   0.00%)     2252.43 * -94.25%*     1177.48 *  -1.55%*
> Amean     elsp-32      703.28 (   0.00%)      699.10 *   0.59%*      703.99 *  -0.10%*
> BAmean-95 user-32    19701.79 (   0.00%)    18425.01 (   6.48%)    19704.78 (  -0.02%)
> BAmean-95 syst-32     1159.43 (   0.00%)     2251.86 ( -94.22%)     1177.03 (  -1.52%)
> BAmean-95 elsp-32      703.24 (   0.00%)      698.99 (   0.61%)      703.88 (  -0.09%)
> BAmean-99 user-32    19701.79 (   0.00%)    18425.01 (   6.48%)    19704.78 (  -0.02%)
> BAmean-99 syst-32     1159.43 (   0.00%)     2251.86 ( -94.22%)     1177.03 (  -1.52%)
> BAmean-99 elsp-32      703.24 (   0.00%)      698.99 (   0.61%)      703.88 (  -0.09%)
> 
> When cgroup memory.high=2G, high memory pressure, mthp_ext improved by 26%.
> 
>                             always                 never               always+mthp_ext
> Amean     user-32    20250.65 (   0.00%)    18368.91 *   9.29%*    18681.27 *   7.75%*
> Amean     syst-32    12778.56 (   0.00%)     9636.99 *  24.58%*     9392.65 *  26.50%*
> Amean     elsp-32     1377.55 (   0.00%)     1026.10 *  25.51%*     1019.40 *  26.00%*
> BAmean-95 user-32    20233.75 (   0.00%)    18353.57 (   9.29%)    18678.01 (   7.69%)
> BAmean-95 syst-32    12543.21 (   0.00%)     9612.28 (  23.37%)     9386.83 (  25.16%)
> BAmean-95 elsp-32     1367.82 (   0.00%)     1023.75 (  25.15%)     1018.17 (  25.56%)
> BAmean-99 user-32    20233.75 (   0.00%)    18353.57 (   9.29%)    18678.01 (   7.69%)
> BAmean-99 syst-32    12543.21 (   0.00%)     9612.28 (  23.37%)     9386.83 (  25.16%)
> BAmean-99 elsp-32     1367.82 (   0.00%)     1023.75 (  25.15%)     1018.17 (  25.56%)
> 
> TODO
> ====
> 
> - mthp_ext handles different "enum tva_type" values. For example, for
>   small-memory processes, only 4KB is used in TVA_PAGEFAULT, while
>   TVA_KHUGEPAGED/TVA_FORCED_COLLAPSE continues to collapse all mthp
>   size. Under high memory pressure, only 4KB is used for
>   TVA_PAGEFAULT/TVA_KHUGEPAGED, while TVA_FORCED_COLLAPSE continues to
>   collapse all mthp size.
> - selftest
> 
> If there are additional scenarios, please let me know as well, so I can
> conduct further prototype verification tests to make mTHP more
> transparent and further clear/stabilize the BPF-THP ABI.

How is it more transparent if you're essentially adding mTHP
micro-programmability from the user's side? This series makes it
_less_ transparent.

If you actually want to make it more transparent, then I would suggest
improving the heuristics such that (m)THP doesn't churn through memory
on high memory pressure. Or such that it doesn't feel extremely compelled
to place the largest THP it can based on vibes.

-- 
Pedro

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent
  2026-05-08 16:00 ` Pedro Falcato
@ 2026-05-08 16:15   ` Lorenzo Stoakes
  0 siblings, 0 replies; 22+ messages in thread
From: Lorenzo Stoakes @ 2026-05-08 16:15 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: Vernon Yang, akpm, david, roman.gushchin, inwardvessel,
	shakeel.butt, ast, daniel, surenb, tz2294, baohua, lance.yang,
	dev.jain, laoar.shao, gutierrez.asier, linux-kernel, linux-mm,
	bpf, Vernon Yang

On Fri, May 08, 2026 at 05:00:04PM +0100, Pedro Falcato wrote:
> On Fri, May 08, 2026 at 11:00:51PM +0800, Vernon Yang wrote:
> > From: Vernon Yang <yanglincheng@kylinos.cn>
> >
> > Hi all,
> >
> > Background
> > ==========
> >
> > As is well known, a system can simultaneously run multiple different
> > scenarios. However, THP is not beneficial in every scenario — it is only
> > most suitable for memory-intensive applications that are not sensitive
> > to tail latency. For example, Redis, which is sensitive to tail latency,
> > is not suitable for THP. But in practice, due to Redis issues, the
> > entire THP functionality is often turned off, preventing other scenarios
> > from benefiting from it.
> >
> > There are also some embedded scenarios (e.g. Android) that directly use
> > 2MB THP, where the granularity is too large. Therefore, we introduced
> > mTHP in v6.8, which supports multiple-size THP. In practice, however, we
> > still globally fix a single mTHP size and are unable to automatically
> > select different mTHP sizes based on different scenarios.
> >
> > After testing, it was found that
> >
> > - When the system has a lot of free memory, it is normal for Redis to
> >   use mTHP. performance degradation in Redis only occurs when the system
> >   is under high memory pressure.
> > - Additionally, when a large number of small-memory processes use mTHP,
> >   memory waste is prone to occur, and performance degradation may also
> >   happen during fast memory allocation/release.
> >
> > Previously, "Cgroup-based THP control"[1] was proposed, but it had the
> > following issues.
> >
> > - It breaks the cgroup hierarchy property.
> > - Add new THP knobs, making sysadmin's job more complex
> >
> > Previously, "mm, bpf: BPF-MM, BPF-THP"[2] was proposed, but it had the
> > following issues.
> >
> > - It didn't address the issue on the per-process mode.
> > - For global mode, the prctl(PR_SET_THP_DISABLE) has already achieved
> >   the same objective, there is no need to add two mechanisms for the
> >   same purpose.
> > - Attaching st_ops to mm_struct, the same issues that cgroup-bpf once
> >   faced are likely to arise again, e.g. lifetime of cgroup vs bpf, dying
> >   cgroups, wq deadlock, etc. It is recommended to use cgroup-bpf for
> >   implementation.
> > - Unclear ABI stability guarantees.
> > - The test cases are too simplistic, lacking eBPF cases similar to real
> >   workloads such as sched_ext.
> >
> > If I miss some thing, please let me know. Thanks!
> >
> <snip>
> > kernbench results
> > ~~~~~~~~~~~~~~~~~
> >
> > When cgroup memory.high=max, no memory pressure, seems only noise level
> > changes, mthp_ext no regression.
> >
> >                             always                 never               always+mthp_ext
> > Amean     user-32    19702.39 (   0.00%)    18428.90 *   6.46%*    19706.73 (  -0.02%)
> > Amean     syst-32     1159.55 (   0.00%)     2252.43 * -94.25%*     1177.48 *  -1.55%*
> > Amean     elsp-32      703.28 (   0.00%)      699.10 *   0.59%*      703.99 *  -0.10%*
> > BAmean-95 user-32    19701.79 (   0.00%)    18425.01 (   6.48%)    19704.78 (  -0.02%)
> > BAmean-95 syst-32     1159.43 (   0.00%)     2251.86 ( -94.22%)     1177.03 (  -1.52%)
> > BAmean-95 elsp-32      703.24 (   0.00%)      698.99 (   0.61%)      703.88 (  -0.09%)
> > BAmean-99 user-32    19701.79 (   0.00%)    18425.01 (   6.48%)    19704.78 (  -0.02%)
> > BAmean-99 syst-32     1159.43 (   0.00%)     2251.86 ( -94.22%)     1177.03 (  -1.52%)
> > BAmean-99 elsp-32      703.24 (   0.00%)      698.99 (   0.61%)      703.88 (  -0.09%)
> >
> > When cgroup memory.high=2G, high memory pressure, mthp_ext improved by 26%.
> >
> >                             always                 never               always+mthp_ext
> > Amean     user-32    20250.65 (   0.00%)    18368.91 *   9.29%*    18681.27 *   7.75%*
> > Amean     syst-32    12778.56 (   0.00%)     9636.99 *  24.58%*     9392.65 *  26.50%*
> > Amean     elsp-32     1377.55 (   0.00%)     1026.10 *  25.51%*     1019.40 *  26.00%*
> > BAmean-95 user-32    20233.75 (   0.00%)    18353.57 (   9.29%)    18678.01 (   7.69%)
> > BAmean-95 syst-32    12543.21 (   0.00%)     9612.28 (  23.37%)     9386.83 (  25.16%)
> > BAmean-95 elsp-32     1367.82 (   0.00%)     1023.75 (  25.15%)     1018.17 (  25.56%)
> > BAmean-99 user-32    20233.75 (   0.00%)    18353.57 (   9.29%)    18678.01 (   7.69%)
> > BAmean-99 syst-32    12543.21 (   0.00%)     9612.28 (  23.37%)     9386.83 (  25.16%)
> > BAmean-99 elsp-32     1367.82 (   0.00%)     1023.75 (  25.15%)     1018.17 (  25.56%)
> >
> > TODO
> > ====
> >
> > - mthp_ext handles different "enum tva_type" values. For example, for
> >   small-memory processes, only 4KB is used in TVA_PAGEFAULT, while
> >   TVA_KHUGEPAGED/TVA_FORCED_COLLAPSE continues to collapse all mthp
> >   size. Under high memory pressure, only 4KB is used for
> >   TVA_PAGEFAULT/TVA_KHUGEPAGED, while TVA_FORCED_COLLAPSE continues to
> >   collapse all mthp size.
> > - selftest
> >
> > If there are additional scenarios, please let me know as well, so I can
> > conduct further prototype verification tests to make mTHP more
> > transparent and further clear/stabilize the BPF-THP ABI.
>
> How is it more transparent if you're essentially adding mTHP
> micro-programmability from the user's side? This series makes it
> _less_ transparent.
>
> If you actually want to make it more transparent, then I would suggest
> improving the heuristics such that (m)THP doesn't churn through memory
> on high memory pressure. Or such that it doesn't feel extremely compelled
> to place the largest THP it can based on vibes.

I agree but I also don't really want to see anything like that until mTHP is
actually stabilised and the code base is less appalling :)

We've deferred paying down technical debt far too long.

>
> --
> Pedro

Thanks, Lorenzo

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2026-05-11 11:25 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-08 15:00 [PATCH v2 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent Vernon Yang
2026-05-08 15:00 ` [PATCH v2 1/4] psi: add psi_group_flush_stats() function Vernon Yang
2026-05-08 15:19   ` Lorenzo Stoakes
2026-05-08 21:36   ` sashiko-bot
2026-05-08 15:00 ` [PATCH v2 2/4] bpf: add bpf_cgroup_{flush_stats,stall} function Vernon Yang
2026-05-08 15:40   ` bot+bpf-ci
2026-05-08 22:01   ` sashiko-bot
2026-05-08 15:00 ` [PATCH v2 3/4] mm: introduce bpf_mthp_ops struct ops Vernon Yang
2026-05-08 15:40   ` bot+bpf-ci
2026-05-08 15:57   ` Lorenzo Stoakes
2026-05-08 20:54   ` David Hildenbrand (Arm)
2026-05-11 11:25     ` Lorenzo Stoakes
2026-05-08 22:29   ` sashiko-bot
2026-05-08 15:00 ` [PATCH v2 4/4] samples: bpf: add mthp_ext Vernon Yang
2026-05-08 15:40   ` bot+bpf-ci
2026-05-08 22:52   ` sashiko-bot
2026-05-08 15:14 ` [PATCH v2 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent Lorenzo Stoakes
2026-05-08 16:05   ` Lorenzo Stoakes
2026-05-08 16:53     ` Vernon Yang
2026-05-11 11:20       ` Lorenzo Stoakes
2026-05-08 16:00 ` Pedro Falcato
2026-05-08 16:15   ` Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox