[PATCH 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent
@ 2026-05-03 16:50 Vernon Yang
  2026-05-03 16:50 ` [PATCH 1/4] psi: add psi_group_flush_stats() function Vernon Yang
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Vernon Yang @ 2026-05-03 16:50 UTC (permalink / raw)
  To: akpm, david, ljs, roman.gushchin, inwardvessel, shakeel.butt, ast,
	daniel, surenb
  Cc: linux-kernel, linux-mm, bpf, baohua, lance.yang, dev.jain,
	Vernon Yang

From: Vernon Yang <yanglincheng@kylinos.cn>

Hi all,

Background
==========

As is well known, a system can simultaneously run multiple different
scenarios. However, THP is not beneficial in every scenario — it is only
most suitable for memory-intensive applications that are not sensitive
to tail latency. For example, Redis, which is sensitive to tail latency,
is not suitable for THP. But in practice, due to Redis issues, the
entire THP functionality is often turned off, preventing other scenarios
from benefiting from it.

There are also some embedded scenarios (e.g. Android) that directly use
2MB THP, where the granularity is too large. Therefore, we introduced
mTHP in v6.8, which supports multiple-size THP. In practice, however, we
still globally fix a single mTHP size and are unable to automatically
select different mTHP sizes based on different scenarios.

After testing, it was found that

- When the system has a lot of free memory, it is normal for Redis to
  use mTHP. performance degradation in Redis only occurs when the system
  is under high memory pressure.
- Additionally, when a large number of small-memory processes use mTHP,
  memory waste is prone to occur, and performance degradation may also
  happen during fast memory allocation/release.

Previously, "Cgroup-based THP control"[1] was proposed, but it had the
following issues.

- It breaks the cgroup hierarchy property.
- Add new THP knobs, making sysadmin's job more complex

Previously, "mm, bpf: BPF-MM, BPF-THP"[2] was proposed, but it had the
following issues.

- It didn't address the issue on the per-process mode.
- For global mode, the prctl(PR_SET_THP_DISABLE) has already achieved
  the same objective, there is no need to add two mechanisms for the
  same purpose.
- Attaching st_ops to mm_struct, the same issues that cgroup-bpf once
  faced are likely to arise again, e.g. lifetime of cgroup vs bpf, dying
  cgroups, wq deadlock, etc. It is recommended to use cgroup-bpf for
  implementation.
- The test cases are too simplistic, lacking eBPF cases similar to real
  workloads such as sched_ext.

If I miss some thing, please let me know. Thanks!

Solution
========

This series will solve all the problems mentioned above.

1. Using cgroup-bpf to customize mTHP size for different scenarios
2. Use a cgroup eBPF program to monitor all sub-cgroups. Sub-cgroups
   under the same parent-cgroup adopt the same eBPF program. Only multiple
   sibling-cgroups (where the parent-cgroup has no attached eBPF program)
   are supported to attach multiple different eBPF programs without
   breaking the hierarchy property of the cgroup.
3. Automatically select different mTHP sizes for different cgroups,
   let's focus on making them truly transparent.
4. Design mthp_ext case to address real workload issues.

The main functions of the mthp_ext are as follows:

- When sub-cgroup is under high memory pressure (default, full 100ms 1s),
  it will automatically fallback to using 4KB.
- When the anon+shmem memory usage of sub-cgroup falls below the minimum
  memory (default 16MB), small-memory processes will automatically
  fallback to using 4KB.
- Under normal conditions, when there is no memory pressure and the
  anon+shmem memory usage exceeds the minimum memory, all mTHP sizes
  shall be utilized by kernel.
- Monitor the root-cgroup (/sys/fs/cgroup) directory by default, with
  support for specifying any cgroup directory.

Performance
===========

The below is some performance test results, testing on x86_64 machine
(AMD Ryzen9 9950X 16C32T, 32G memory, 8G zram).

NOTE: The following always/never labels indicate setting all mTHP sizes
to always/never. Detailed test script reference[4].

redis results
~~~~~~~~~~~~~

command: redis-benchmark --csv -r 3000000 -n 3000000 -d 1024 -c 16 -P 32 -t set

When cgroup memory.high=max.

| redis-noBGSAVE | always      | never                | always+mthp_ext      |
|----------------|-------------|----------------------|----------------------|
| rps            | 1410824.167 | 1210387.500 (-14.2%) | 1265659.833 (-10.3%) |
| avg_latency_ms | 0.220       | 0.259       (-17.7%) | 0.247       (-12.3%) |
| p95_latency_ms | 0.618       | 0.708       (-14.6%) | 0.676       (-9.40%) |
| p99_latency_ms | 0.687       | 0.818       (-19.1%) | 0.756       (-10.0%) |

| redis-BGSAVE   | always      | never                | always+mthp_ext      |
|----------------|-------------|----------------------|----------------------|
| rps            | 1418032.127 | 1212306.873 (-14.5%) | 1261069.373 (-11.1%) |
| avg_latency_ms | 0.218       | 0.259       (-18.8%) | 0.248       (-13.8%) |
| p95_latency_ms | 0.620       | 0.714       (-15.2%) | 0.687       (-10.8%) |
| p99_latency_ms | 0.684       | 0.828       (-21.1%) | 0.756       (-10.5%) |

When cgroup memory.high=2G.

| redis-noBGSAVE | always    | never                 | always+mthp_ext       |
|----------------|-----------|-----------------------|-----------------------|
| rps            | 24813.980 | 1049254.583 (4128.5%) | 1063171.270 (4184.6%) |
| avg_latency_ms | 13.317    | 0.302       (  97.7%) | 0.298       (  97.8%) |
| p95_latency_ms | 23.220    | 0.754       (  96.8%) | 0.828       (  96.4%) |
| p99_latency_ms | 369.492   | 1.154       (  99.7%) | 1.615       (  99.6%) |

| redis-BGSAVE   | always    | never                 | always+mthp_ext       |
|----------------|-----------|-----------------------|-----------------------|
| rps            | 48373.433 | 1058403.500 (2088.0%) | 1070805.707 (2113.6%) |
| avg_latency_ms | 6.884     | 0.300       (  95.6%) | 0.296       (  95.7%) |
| p95_latency_ms | 16.474    | 0.743       (  95.5%) | 0.820       (  95.0%) |
| p99_latency_ms | 326.058   | 1.170       (  99.6%) | 1.586       (  99.5%) |

When the redis is under no memory pressure, RPS drops by 10.3% (from 1.4M to
1.2M, Is this within the acceptable range?).

However, under high memory pressure, RPS improve by 4184.6% (from 24K to 1M),
while significantly reducing the tail latency by 99%.

unixbench results
~~~~~~~~~~~~~~~~~

command: ./Run -c 1 shell8

| unixbench shell8 | always  |      never      | always+mthp_ext |
|------------------|---------|-----------------|-----------------|
| Score            | 23019.4 | 24378.3 (5.90%) | 24314.5 (5.63%) |

mthp_ext improved by 5.63%.

kernbench results
~~~~~~~~~~~~~~~~~

When cgroup memory.high=max, mthp_ext no regression.

                            always                 never               always+mthp_ext
Amean     user-32    19666.44 (   0.00%)    18464.56 *   6.11%*    19650.13 *   0.08%*
Amean     syst-32     1169.16 (   0.00%)     2235.17 * -91.18%*     1169.42 (  -0.02%)
Amean     elsp-32      702.51 (   0.00%)      699.90 *   0.37%*      702.15 (   0.05%)
BAmean-95 user-32    19665.93 (   0.00%)    18461.86 (   6.12%)    19647.61 (   0.09%)
BAmean-95 syst-32     1168.68 (   0.00%)     2234.27 ( -91.18%)     1169.20 (  -0.04%)
BAmean-95 elsp-32      702.34 (   0.00%)      699.80 (   0.36%)      702.04 (   0.04%)
BAmean-99 user-32    19665.93 (   0.00%)    18461.86 (   6.12%)    19647.61 (   0.09%)
BAmean-99 syst-32     1168.68 (   0.00%)     2234.27 ( -91.18%)     1169.20 (  -0.04%)
BAmean-99 elsp-32      702.34 (   0.00%)      699.80 (   0.36%)      702.04 (   0.04%)

When cgroup memory.high=2G, mthp_ext improved by 20.98%.

                            always                 never               always+mthp_ext
Amean     user-32    20459.89 (   0.00%)    18517.24 *   9.49%*    19963.73 *   2.43%*
Amean     syst-32    11890.63 (   0.00%)     6681.95 *  43.80%*     9395.94 *  20.98%*
Amean     elsp-32     1305.29 (   0.00%)      928.13 *  28.89%*     1109.37 *  15.01%*
BAmean-95 user-32    20439.38 (   0.00%)    18510.65 (   9.44%)    19957.89 (   2.36%)
BAmean-95 syst-32    11789.99 (   0.00%)     6679.03 (  43.35%)     9381.77 (  20.43%)
BAmean-95 elsp-32     1302.18 (   0.00%)      927.89 (  28.74%)     1108.65 (  14.86%)
BAmean-99 user-32    20439.38 (   0.00%)    18510.65 (   9.44%)    19957.89 (   2.36%)
BAmean-99 syst-32    11789.99 (   0.00%)     6679.03 (  43.35%)     9381.77 (  20.43%)
BAmean-99 elsp-32     1302.18 (   0.00%)      927.89 (  28.74%)     1108.65 (  14.86%)

TODO
====

- Do not destroy the cgroup hierarchy property. If an eBPF program
  already exists in the sub-cgroup, trigger an error and clear the
  already set bpf_mthp_ops data.
- mthp_ext handles different "enum tva_type" values. For example, for
  small-memory processes, only 4KB is used in TVA_PAGEFAULT, while
  TVA_KHUGEPAGED/TVA_FORCED_COLLAPSE continues to collapse all mthp
  size. Under high memory pressure, only 4KB is used for
  TVA_PAGEFAULT/TVA_KHUGEPAGED, while TVA_FORCED_COLLAPSE continues to
  collapse all mthp size.
- selftest

If there are additional scenarios, please let me know as well, so I can
conduct further prototype verification tests to make mTHP more
transparent.

If any of the above the strategies can be integrated into the kernel,
please let me know. I would be delighted to incorporate these strategies
into the kernel.

This series is based on linux v7.1-rc1 (26fd6bff2c05) + "mm: BPF OOM"[3]
first four patches.

Thank you very much for your comments and discussions.

[1] https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com
[2] https://lore.kernel.org/linux-mm/20251026100159.6103-1-laoar.shao@gmail.com
[3] https://lore.kernel.org/linux-mm/20260127024421.494929-1-roman.gushchin@linux.dev
[4] https://github.com/vernon2gh/app_and_module/tree/main/mthp_ext

Vernon Yang (4):
  psi: add psi_group_flush_stats() function
  bpf: add bpf_cgroup_{flush_stats,stall} function
  mm: introduce bpf_mthp_ops struct ops
  samples: bpf: add mthp_ext

 MAINTAINERS                     |   3 +
 include/linux/bpf_huge_memory.h |  35 ++++
 include/linux/cgroup-defs.h     |   1 +
 include/linux/huge_mm.h         |   6 +
 include/linux/psi.h             |   1 +
 kernel/bpf/helpers.c            |  29 +++
 kernel/sched/psi.c              |  34 +++-
 mm/Kconfig                      |  14 ++
 mm/Makefile                     |   1 +
 mm/bpf_huge_memory.c            | 169 ++++++++++++++++
 samples/bpf/.gitignore          |   1 +
 samples/bpf/Makefile            |   7 +-
 samples/bpf/mthp_ext.bpf.c      | 142 +++++++++++++
 samples/bpf/mthp_ext.c          | 340 ++++++++++++++++++++++++++++++++
 samples/bpf/mthp_ext.h          |  30 +++
 15 files changed, 804 insertions(+), 9 deletions(-)
 create mode 100644 include/linux/bpf_huge_memory.h
 create mode 100644 mm/bpf_huge_memory.c
 create mode 100644 samples/bpf/mthp_ext.bpf.c
 create mode 100644 samples/bpf/mthp_ext.c
 create mode 100644 samples/bpf/mthp_ext.h

--
2.53.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/4] psi: add psi_group_flush_stats() function
  2026-05-03 16:50 [PATCH 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent Vernon Yang
@ 2026-05-03 16:50 ` Vernon Yang
  2026-05-03 16:50 ` [PATCH 2/4] bpf: add bpf_cgroup_{flush_stats,stall} function Vernon Yang
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 14+ messages in thread
From: Vernon Yang @ 2026-05-03 16:50 UTC (permalink / raw)
  To: akpm, david, ljs, roman.gushchin, inwardvessel, shakeel.butt, ast,
	daniel, surenb
  Cc: linux-kernel, linux-mm, bpf, baohua, lance.yang, dev.jain,
	Vernon Yang

From: Vernon Yang <yanglincheng@kylinos.cn>

Add psi_group_flush_stats() function to prepare for the subsequent
mthp_ext ebpf program.

no function changes.

Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
 include/linux/psi.h |  1 +
 kernel/sched/psi.c  | 34 ++++++++++++++++++++++++++--------
 2 files changed, 27 insertions(+), 8 deletions(-)

diff --git a/include/linux/psi.h b/include/linux/psi.h
index e0745873e3f2..7b4fd8190810 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -22,6 +22,7 @@ void psi_init(void);
 void psi_memstall_enter(unsigned long *flags);
 void psi_memstall_leave(unsigned long *flags);
 
+void psi_group_flush_stats(struct psi_group *group);
 int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res);
 struct psi_trigger *psi_trigger_create(struct psi_group *group, char *buf,
 				       enum psi_res res, struct file *file,
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index d9c9d9480a45..76ffad90b0b5 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -1242,11 +1242,35 @@ void psi_cgroup_restart(struct psi_group *group)
 }
 #endif /* CONFIG_CGROUPS */
 
+/*
+ * __psi_group_flush_stats - flush the total stall time of a psi group
+ * @group: psi group to flush
+ */
+static void __psi_group_flush_stats(struct psi_group *group)
+{
+	u64 now;
+
+	/* Update averages before reporting them */
+	mutex_lock(&group->avgs_lock);
+	now = sched_clock();
+	collect_percpu_times(group, PSI_AVGS, NULL);
+	if (now >= group->avg_next_update)
+		group->avg_next_update = update_averages(group, now);
+	mutex_unlock(&group->avgs_lock);
+}
+
+void psi_group_flush_stats(struct psi_group *group)
+{
+	if (static_branch_likely(&psi_disabled))
+		return;
+
+	__psi_group_flush_stats(group);
+}
+
 int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
 {
 	bool only_full = false;
 	int full;
-	u64 now;
 
 	if (static_branch_likely(&psi_disabled))
 		return -EOPNOTSUPP;
@@ -1256,13 +1280,7 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
 		return -EOPNOTSUPP;
 #endif
 
-	/* Update averages before reporting them */
-	mutex_lock(&group->avgs_lock);
-	now = sched_clock();
-	collect_percpu_times(group, PSI_AVGS, NULL);
-	if (now >= group->avg_next_update)
-		group->avg_next_update = update_averages(group, now);
-	mutex_unlock(&group->avgs_lock);
+	__psi_group_flush_stats(group);
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
 	only_full = res == PSI_IRQ;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/4] bpf: add bpf_cgroup_{flush_stats,stall} function
  2026-05-03 16:50 [PATCH 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent Vernon Yang
  2026-05-03 16:50 ` [PATCH 1/4] psi: add psi_group_flush_stats() function Vernon Yang
@ 2026-05-03 16:50 ` Vernon Yang
  2026-05-03 17:23   ` bot+bpf-ci
       [not found]   ` <20260503172520.376AAC2BCB4@smtp.kernel.org>
  2026-05-03 16:50 ` [PATCH 3/4] mm: introduce bpf_mthp_ops struct ops Vernon Yang
  2026-05-03 16:50 ` [PATCH 4/4] samples: bpf: add mthp_ext Vernon Yang
  3 siblings, 2 replies; 14+ messages in thread
From: Vernon Yang @ 2026-05-03 16:50 UTC (permalink / raw)
  To: akpm, david, ljs, roman.gushchin, inwardvessel, shakeel.butt, ast,
	daniel, surenb
  Cc: linux-kernel, linux-mm, bpf, baohua, lance.yang, dev.jain,
	Vernon Yang

From: Vernon Yang <yanglincheng@kylinos.cn>

Add bpf_cgroup_{flush_stats,stall} function to prepare for the
subsequent mthp_ext ebpf program.

no function changes.

Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
 kernel/bpf/helpers.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 2bb60200c266..87f3072adce3 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -29,6 +29,7 @@
 #include <linux/task_work.h>
 #include <linux/irq_work.h>
 #include <linux/buildid.h>
+#include <linux/psi.h>
 
 #include "../../lib/kstrtox.h"
 
@@ -2819,6 +2820,32 @@ __bpf_kfunc struct cgroup *bpf_cgroup_from_id(u64 cgid)
 	return cgrp;
 }
 
+/**
+ * bpf_cgroup_stall - acquire the total stall time of cgroup
+ * @cgrp: cgroup struct
+ * @states: psi states
+ *
+ * Return the total stall time.
+ */
+__bpf_kfunc unsigned long bpf_cgroup_stall(struct cgroup *cgrp,
+					   enum psi_states states)
+{
+	struct psi_group *group = cgroup_psi(cgrp);
+
+	return div_u64(group->total[PSI_AVGS][states], NSEC_PER_MSEC);
+}
+
+/**
+ * bpf_cgroup_flush_stats - Flush cgroup's statistics
+ * @cgrp: cgroup struct
+ */
+__bpf_kfunc void bpf_cgroup_flush_stats(struct cgroup *cgrp)
+{
+	struct psi_group *group = cgroup_psi(cgrp);
+
+	psi_group_flush_stats(group);
+}
+
 /**
  * bpf_task_under_cgroup - wrap task_under_cgroup_hierarchy() as a kfunc, test
  * task's membership of cgroup ancestry.
@@ -4732,6 +4759,8 @@ BTF_ID_FLAGS(func, bpf_cgroup_acquire, KF_ACQUIRE | KF_RCU | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_cgroup_release, KF_RELEASE)
 BTF_ID_FLAGS(func, bpf_cgroup_ancestor, KF_ACQUIRE | KF_RCU | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_cgroup_from_id, KF_ACQUIRE | KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_cgroup_stall)
+BTF_ID_FLAGS(func, bpf_cgroup_flush_stats, KF_SLEEPABLE)
 BTF_ID_FLAGS(func, bpf_task_under_cgroup, KF_RCU)
 BTF_ID_FLAGS(func, bpf_task_get_cgroup1, KF_ACQUIRE | KF_RCU | KF_RET_NULL)
 #endif
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/4] bpf: add bpf_cgroup_{flush_stats,stall} function
  2026-05-03 16:50 ` [PATCH 2/4] bpf: add bpf_cgroup_{flush_stats,stall} function Vernon Yang
@ 2026-05-03 17:23   ` bot+bpf-ci
  2026-05-06 12:38     ` Vernon Yang
       [not found]   ` <20260503172520.376AAC2BCB4@smtp.kernel.org>
  1 sibling, 1 reply; 14+ messages in thread
From: bot+bpf-ci @ 2026-05-03 17:23 UTC (permalink / raw)
  To: vernon2gm, akpm, david, ljs, roman.gushchin, inwardvessel,
	shakeel.butt, ast, daniel, surenb
  Cc: linux-kernel, linux-mm, bpf, baohua, lance.yang, dev.jain,
	yanglincheng, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 3397 bytes --]

> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> index baa12b24bb64..d7b04f1fc9f1 100644
> --- a/kernel/bpf/helpers.c
> +++ b/kernel/bpf/helpers.c

[ ... ]

> @@ -2819,6 +2820,32 @@ __bpf_kfunc struct cgroup *bpf_cgroup_from_id(u64 cgid)
>  	return cgrp;
>  }
>
> +/**
> + * bpf_cgroup_stall - acquire the total stall time of cgroup
> + * @cgrp: cgroup struct
> + * @states: psi states
> + *
> + * Return the total stall time.
> + */
> +__bpf_kfunc unsigned long bpf_cgroup_stall(struct cgroup *cgrp,
> +					   enum psi_states states)
> +{
> +	struct psi_group *group = cgroup_psi(cgrp);
> +
> +	return div_u64(group->total[PSI_AVGS][states], NSEC_PER_MSEC);
> +}

Can cgroup_psi() return NULL here? Looking at include/linux/psi.h, cgroup_psi()
returns cgrp->psi for non-root cgroups, which is allocated in
psi_cgroup_alloc() only when the psi_cgroups_enabled static branch is true.

When the kernel boots with cgroup_disable=pressure or psi=0, cgrp->psi remains
NULL for all non-root cgroups. The sysfs *.pressure files are gated by
cgroup_psi_enabled() so they never hit this case, but these new kfuncs are
directly callable from BPF programs without such gating.

A BPF program could obtain a valid non-root cgroup (e.g., via
bpf_cgroup_from_id) and call bpf_cgroup_stall on it. If PSI is disabled for
cgroups, this would dereference NULL at group->total[...].

Similarly for bpf_cgroup_flush_stats below, passing NULL to
psi_group_flush_stats would eventually reach __psi_group_flush_stats which
does mutex_lock(&NULL->avgs_lock).

Should these functions check for NULL after calling cgroup_psi()?

A subsystem pattern also flags this as potentially concerning: the `states`
parameter is used directly as an array index without bounds checking. The
backing array is declared as u64 total[NR_PSI_AGGREGATORS][NR_PSI_STATES - 1]
(see include/linux/psi_types.h, struct psi_group), while enum psi_states has
the range 0..NR_PSI_STATES-1 (PSI_NONIDLE is the last member).

The BPF verifier only checks that the register is SCALAR_VALUE and does not
validate enum ranges. Because the parameter is not marked __const, a BPF
program can pass any scalar, including values derived from map lookups or
arithmetic.

Passing PSI_NONIDLE (a valid enum value) reads one element past the inner
array; passing a negative value (legal because enum is signed int in C) reads
memory before the array. This would leak adjacent memory contents to the BPF
program.

Existing BPF cgroup-stat kfuncs guard against this pattern:

bpf_mem_cgroup_page_state (mm/bpf_memcontrol.c):
	if (unlikely(!memcg_stat_item_valid(idx)))
		return (unsigned long)-1;

bpf_mem_cgroup_memory_events (mm/bpf_memcontrol.c):
	if (unlikely(event >= MEMCG_NR_MEMORY_EVENTS))
		return (unsigned long)-1;

Should this function add a similar check, such as:
	if ((unsigned int)states >= NR_PSI_STATES - 1)
		return -1;

> +
> +/**
> + * bpf_cgroup_flush_stats - Flush cgroup's statistics
> + * @cgrp: cgroup struct
> + */
> +__bpf_kfunc void bpf_cgroup_flush_stats(struct cgroup *cgrp)
> +{
> +	struct psi_group *group = cgroup_psi(cgrp);
> +
> +	psi_group_flush_stats(group);
> +}

[ ... ]

---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/25285279315

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/4] bpf: add bpf_cgroup_{flush_stats,stall} function
  2026-05-03 17:23   ` bot+bpf-ci
@ 2026-05-06 12:38     ` Vernon Yang
  0 siblings, 0 replies; 14+ messages in thread
From: Vernon Yang @ 2026-05-06 12:38 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: akpm, david, ljs, roman.gushchin, inwardvessel, shakeel.butt, ast,
	daniel, surenb, linux-kernel, linux-mm, bpf, baohua, lance.yang,
	dev.jain, yanglincheng, andrii, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

On Mon, May 4, 2026 at 1:23 AM <bot+bpf-ci@kernel.org> wrote:
>
> > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > index baa12b24bb64..d7b04f1fc9f1 100644
> > --- a/kernel/bpf/helpers.c
> > +++ b/kernel/bpf/helpers.c
>
> [ ... ]
>
> > @@ -2819,6 +2820,32 @@ __bpf_kfunc struct cgroup *bpf_cgroup_from_id(u64 cgid)
> >       return cgrp;
> >  }
> >
> > +/**
> > + * bpf_cgroup_stall - acquire the total stall time of cgroup
> > + * @cgrp: cgroup struct
> > + * @states: psi states
> > + *
> > + * Return the total stall time.
> > + */
> > +__bpf_kfunc unsigned long bpf_cgroup_stall(struct cgroup *cgrp,
> > +                                        enum psi_states states)
> > +{
> > +     struct psi_group *group = cgroup_psi(cgrp);
> > +
> > +     return div_u64(group->total[PSI_AVGS][states], NSEC_PER_MSEC);
> > +}
>
> Can cgroup_psi() return NULL here? Looking at include/linux/psi.h, cgroup_psi()
> returns cgrp->psi for non-root cgroups, which is allocated in
> psi_cgroup_alloc() only when the psi_cgroups_enabled static branch is true.
>
> When the kernel boots with cgroup_disable=pressure or psi=0, cgrp->psi remains
> NULL for all non-root cgroups. The sysfs *.pressure files are gated by
> cgroup_psi_enabled() so they never hit this case, but these new kfuncs are
> directly callable from BPF programs without such gating.
>
> A BPF program could obtain a valid non-root cgroup (e.g., via
> bpf_cgroup_from_id) and call bpf_cgroup_stall on it. If PSI is disabled for
> cgroups, this would dereference NULL at group->total[...].
>
> Similarly for bpf_cgroup_flush_stats below, passing NULL to
> psi_group_flush_stats would eventually reach __psi_group_flush_stats which
> does mutex_lock(&NULL->avgs_lock).
>
> Should these functions check for NULL after calling cgroup_psi()?

Good catch, I will fix that in the next version. Thanks!

> A subsystem pattern also flags this as potentially concerning: the `states`
> parameter is used directly as an array index without bounds checking. The
> backing array is declared as u64 total[NR_PSI_AGGREGATORS][NR_PSI_STATES - 1]
> (see include/linux/psi_types.h, struct psi_group), while enum psi_states has
> the range 0..NR_PSI_STATES-1 (PSI_NONIDLE is the last member).
>
> The BPF verifier only checks that the register is SCALAR_VALUE and does not
> validate enum ranges. Because the parameter is not marked __const, a BPF
> program can pass any scalar, including values derived from map lookups or
> arithmetic.
>
> Passing PSI_NONIDLE (a valid enum value) reads one element past the inner
> array; passing a negative value (legal because enum is signed int in C) reads
> memory before the array. This would leak adjacent memory contents to the BPF
> program.
>
> Existing BPF cgroup-stat kfuncs guard against this pattern:
>
> bpf_mem_cgroup_page_state (mm/bpf_memcontrol.c):
>         if (unlikely(!memcg_stat_item_valid(idx)))
>                 return (unsigned long)-1;
>
> bpf_mem_cgroup_memory_events (mm/bpf_memcontrol.c):
>         if (unlikely(event >= MEMCG_NR_MEMORY_EVENTS))
>                 return (unsigned long)-1;

Good catch, I will add bounds checking, in a format similar to the
above. Thanks!

> Should this function add a similar check, such as:
>         if ((unsigned int)states >= NR_PSI_STATES - 1)
>                 return -1;
>
> > +
> > +/**
> > + * bpf_cgroup_flush_stats - Flush cgroup's statistics
> > + * @cgrp: cgroup struct
> > + */
> > +__bpf_kfunc void bpf_cgroup_flush_stats(struct cgroup *cgrp)
> > +{
> > +     struct psi_group *group = cgroup_psi(cgrp);
> > +
> > +     psi_group_flush_stats(group);
> > +}
>
> [ ... ]
>
>
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
>
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/25285279315

^ permalink raw reply	[flat|nested] 14+ messages in thread

[parent not found: <20260503172520.376AAC2BCB4@smtp.kernel.org>]

* Re: [PATCH 2/4] bpf: add bpf_cgroup_{flush_stats,stall} function
       [not found]   ` <20260503172520.376AAC2BCB4@smtp.kernel.org>
@ 2026-05-06 12:55     ` Vernon Yang
  0 siblings, 0 replies; 14+ messages in thread
From: Vernon Yang @ 2026-05-06 12:55 UTC (permalink / raw)
  To: sashiko
  Cc: bpf, Andrew Morton, David Hildenbrand (Red Hat),
	Lorenzo Stoakes (Oracle), Roman Gushchin, inwardvessel,
	Shakeel Butt, ast, Daniel Borkmann, surenb, linux-kernel,
	linux-mm, baohua, Lance Yang, Dev Jain

Forwarding sashiko review

On Mon, May 4, 2026 at 1:25 AM <sashiko-bot@kernel.org> wrote:
>
> Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:
> - [High] Missing bounds check on the `states` enum parameter in `bpf_cgroup_stall` allows out-of-bounds memory access.
> - [High] Both `bpf_cgroup_stall` and `bpf_cgroup_flush_stats` unconditionally dereference the pointer returned by `cgroup_psi()`, which can be NULL.
> - [High] Compilation fails when `CONFIG_CGROUPS=y` but `CONFIG_PSI=n`.
> - [Medium] The `bpf_cgroup_stall` function truncates its 64-bit return value on 32-bit architectures.
> --
>
> commit 11e75919a882d61e2b11423881cab05305d77c18
> Author: Vernon Yang <yanglincheng@kylinos.cn>
>
> bpf: add bpf_cgroup_{flush_stats,stall} function
>
> Add bpf_cgroup_{flush_stats,stall} function to prepare for the
> subsequent mthp_ext ebpf program.
>
> > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
> > index 2bb60200c266..87f3072adce3 100644
> > --- a/kernel/bpf/helpers.c
> > +++ b/kernel/bpf/helpers.c
>
> [ ... ]
>
> > @@ -2819,6 +2820,32 @@ __bpf_kfunc struct cgroup *bpf_cgroup_from_id(u64 cgid)
> >       return cgrp;
> >  }
> >
> > +/**
> > + * bpf_cgroup_stall - acquire the total stall time of cgroup
> > + * @cgrp: cgroup struct
> > + * @states: psi states
> > + *
> > + * Return the total stall time.
> > + */
> > +__bpf_kfunc unsigned long bpf_cgroup_stall(struct cgroup *cgrp,
> > +                                        enum psi_states states)
>
> Is it possible that div_u64() returns a 64-bit value that gets truncated
> here on 32-bit architectures? Since unsigned long is 32 bits on those
> systems, would changing the return type to u64 prevent the stall time
> from wrapping around?

Yes, I will fix that in the next version. Thanks!

> One more thing regarding the build, what happens when CONFIG_CGROUPS is
> enabled but CONFIG_PSI is disabled? Since struct psi_group might be defined
> as an empty struct without a total field in include/linux/psi_types.h, and
> cgroup_psi() wouldn't be declared, would it be better to wrap these kfuncs
> and their BTF_ID_FLAGS in an #ifdef CONFIG_PSI block?

Indeed, this is a problem. Compared to adding “#ifdef CONFIG_PSI”, I
prefer adding follow, as it is clearer, Thanks!

diff --git a/include/linux/psi.h b/include/linux/psi.h
index 7b4fd8190810..243dcf97bea4 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -52,6 +52,10 @@ static inline void psi_memstall_enter(unsigned long
*flags) {}
 static inline void psi_memstall_leave(unsigned long *flags) {}

 #ifdef CONFIG_CGROUPS
+static inline struct psi_group *cgroup_psi(struct cgroup *cgrp)
+{
+       return NULL;
+}
 static inline int psi_cgroup_alloc(struct cgroup *cgrp)

> > +{
> > +     struct psi_group *group = cgroup_psi(cgrp);
> > +
> > +     return div_u64(group->total[PSI_AVGS][states], NSEC_PER_MSEC);
>
> Could states be out of bounds here? Since the BPF verifier treats enum
> parameters as unconstrained scalars, a BPF program might pass an arbitrary
> or negative integer. Does this need a runtime bounds check like
> states < 0 || states >= NR_PSI_STATES - 1 ?

Good catch, I will fix that in the next version. Thanks!

> Also, is it possible for group to be NULL? If cgroup PSI tracking is
> disabled at runtime via the cgroup_disable=pressure boot parameter,
> cgroup_psi(cgrp) can return NULL for non-root cgroups. Does this dereference
> of group->total need a NULL check first?

I will fix that in the next version. Thanks!

> > +}
> > +
> > +/**
> > + * bpf_cgroup_flush_stats - Flush cgroup's statistics
> > + * @cgrp: cgroup struct
> > + */
> > +__bpf_kfunc void bpf_cgroup_flush_stats(struct cgroup *cgrp)
> > +{
> > +     struct psi_group *group = cgroup_psi(cgrp);
> > +
> > +     psi_group_flush_stats(group);
>
> Similar to bpf_cgroup_stall(), if PSI tracking is disabled, group could
> be NULL here. Since psi_group_flush_stats() might unconditionally acquire
> a mutex, does this code need a NULL check on group to prevent a panic?

I will fix that in the next version. Thanks!

> > +}
>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260503165024.1526680-1-vernon2gm@gmail.com?part=2

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/4] mm: introduce bpf_mthp_ops struct ops
  2026-05-03 16:50 [PATCH 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent Vernon Yang
  2026-05-03 16:50 ` [PATCH 1/4] psi: add psi_group_flush_stats() function Vernon Yang
  2026-05-03 16:50 ` [PATCH 2/4] bpf: add bpf_cgroup_{flush_stats,stall} function Vernon Yang
@ 2026-05-03 16:50 ` Vernon Yang
  2026-05-03 17:35   ` bot+bpf-ci
       [not found]   ` <20260503174125.2C949C2BCB4@smtp.kernel.org>
  2026-05-03 16:50 ` [PATCH 4/4] samples: bpf: add mthp_ext Vernon Yang
  3 siblings, 2 replies; 14+ messages in thread
From: Vernon Yang @ 2026-05-03 16:50 UTC (permalink / raw)
  To: akpm, david, ljs, roman.gushchin, inwardvessel, shakeel.butt, ast,
	daniel, surenb
  Cc: linux-kernel, linux-mm, bpf, baohua, lance.yang, dev.jain,
	Vernon Yang

From: Vernon Yang <yanglincheng@kylinos.cn>

Introducing bpf_mthp_ops enables eBPF programs to register the
mthp_choose callback function via cgroup-ebpf.

Using cgroup-bpf to customize mTHP size for different scenarios，
automatically select different mTHP sizes for different cgroups,
let's focus on making them truly transparent.

Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
 MAINTAINERS                     |   3 +
 include/linux/bpf_huge_memory.h |  35 +++++++
 include/linux/cgroup-defs.h     |   1 +
 include/linux/huge_mm.h         |   6 ++
 mm/Kconfig                      |  14 +++
 mm/Makefile                     |   1 +
 mm/bpf_huge_memory.c            | 169 ++++++++++++++++++++++++++++++++
 7 files changed, 229 insertions(+)
 create mode 100644 include/linux/bpf_huge_memory.h
 create mode 100644 mm/bpf_huge_memory.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 27a073f53cea..39f00676eeb7 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4887,7 +4887,10 @@ M:	Shakeel Butt <shakeel.butt@linux.dev>
 L:	bpf@vger.kernel.org
 L:	linux-mm@kvack.org
 S:	Maintained
+F:	include/linux/bpf_huge_memory.h
+F:	mm/bpf_huge_memory.c
 F:	mm/bpf_memcontrol.c
+F:	samples/bpf/mthp_ext.*
 
 BPF [MISC]
 L:	bpf@vger.kernel.org
diff --git a/include/linux/bpf_huge_memory.h b/include/linux/bpf_huge_memory.h
new file mode 100644
index 000000000000..1c8a6f7ad8f1
--- /dev/null
+++ b/include/linux/bpf_huge_memory.h
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+
+#ifndef __BPF_HUGE_MEMORY_H
+#define __BPF_HUGE_MEMORY_H
+
+/**
+ * struct bpf_mthp_ops - BPF callbacks for mTHP operations
+ * @mthp_choose: Choose the custom mTHP orders
+ *
+ * This structure defines the interface for BPF programs to customize
+ * mTHP behavior through struct_ops programs.
+ */
+struct bpf_mthp_ops {
+	unsigned long (*mthp_choose)(struct cgroup *cgrp, unsigned long orders);
+};
+
+#if defined(CONFIG_BPF_TRANSPARENT_HUGEPAGE) && defined(CONFIG_BPF_SYSCALL)
+/**
+ * bpf_mthp_choose: Choose the custom mTHP orders using bpf
+ * @mm: task mm_struct
+ * @orders: original orders
+ *
+ * Return suited mTHP orders.
+ */
+unsigned long bpf_mthp_choose(struct mm_struct *mm, unsigned long orders);
+#else
+static inline unsigned long bpf_mthp_choose(struct mm_struct *mm,
+					    unsigned long orders)
+{
+	return orders;
+}
+#endif /* CONFIG_BPF_SYSCALL */
+
+#endif /* __BPF_HUGE_MEMORY_H */
+
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index f42563739d2e..78854d0e06ab 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -628,6 +628,7 @@ struct cgroup {
 
 #ifdef CONFIG_BPF_SYSCALL
 	struct bpf_local_storage __rcu  *bpf_cgrp_storage;
+	struct bpf_mthp_ops *mthp_ops;
 #endif
 #ifdef CONFIG_EXT_SUB_SCHED
 	struct scx_sched __rcu *scx_sched;
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2949e5acff35..80ec622213df 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -3,6 +3,7 @@
 #define _LINUX_HUGE_MM_H
 
 #include <linux/mm_types.h>
+#include <linux/bpf_huge_memory.h>
 
 #include <linux/fs.h> /* only for vma_is_dax() */
 #include <linux/kobject.h>
@@ -291,6 +292,11 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
 				       enum tva_type type,
 				       unsigned long orders)
 {
+	/* The eBPF-specified orders overrides which order is selected. */
+	orders &= bpf_mthp_choose(vma->vm_mm, orders);
+	if (!orders)
+		return 0;
+
 	/*
 	 * Optimization to check if required orders are enabled early. Only
 	 * forced collapse ignores sysfs configs.
diff --git a/mm/Kconfig b/mm/Kconfig
index e8bf1e9e6ad9..12382431ddc7 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -963,6 +963,20 @@ config NO_PAGE_MAPCOUNT
 
 	  EXPERIMENTAL because the impact of some changes is still unclear.
 
+config BPF_TRANSPARENT_HUGEPAGE
+	bool "BPF-based transparent hugepage (EXPERIMENTAL)"
+	depends on TRANSPARENT_HUGEPAGE
+	help
+	  Using cgroup-bpf to customize mTHP size for different scenarios,
+	  automatically select different mTHP sizes for different cgroups,
+	  let's focus on making them truly transparent.
+
+	  This is an experimental feature, that might go away at any time,
+	  Please do not rely any production environment.
+
+	  EXPERIMENTAL because the BPF interface is unstable and may be removed
+	  at any time.
+
 endif # TRANSPARENT_HUGEPAGE
 
 # simple helper to make the code a bit easier to read
diff --git a/mm/Makefile b/mm/Makefile
index 8ad2ab08244e..b474c21c3253 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -108,6 +108,7 @@ obj-$(CONFIG_MEMCG) += swap_cgroup.o
 endif
 ifdef CONFIG_BPF_SYSCALL
 obj-$(CONFIG_MEMCG) += bpf_memcontrol.o
+obj-$(CONFIG_BPF_TRANSPARENT_HUGEPAGE) += bpf_huge_memory.o
 endif
 obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
 obj-$(CONFIG_GUP_TEST) += gup_test.o
diff --git a/mm/bpf_huge_memory.c b/mm/bpf_huge_memory.c
new file mode 100644
index 000000000000..e34e0a35edac
--- /dev/null
+++ b/mm/bpf_huge_memory.c
@@ -0,0 +1,169 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Huge memory related BPF code
+ *
+ * Author: Vernon Yang <yanglincheng@kylinos.cn>
+ */
+
+#include <linux/bpf.h>
+#include <linux/srcu.h>
+
+/* Protects cgrp->mthp_ops pointer for read and write. */
+DEFINE_SRCU(mthp_bpf_srcu);
+
+unsigned long bpf_mthp_choose(struct mm_struct *mm, unsigned long orders)
+{
+	struct cgroup *cgrp;
+	struct mem_cgroup *memcg;
+	struct bpf_mthp_ops *ops;
+	int idx;
+
+	memcg = get_mem_cgroup_from_mm(mm);
+	if (!memcg)
+		return orders;
+
+	cgrp = memcg->css.cgroup;
+	ops = READ_ONCE(cgrp->mthp_ops);
+	if (unlikely(ops)) {
+		idx = srcu_read_lock(&mthp_bpf_srcu);
+		if (ops->mthp_choose)
+			orders = ops->mthp_choose(cgrp, orders);
+		srcu_read_unlock(&mthp_bpf_srcu, idx);
+	}
+
+	mem_cgroup_put(memcg);
+
+	return orders;
+}
+
+static int bpf_mthp_ops_btf_struct_access(struct bpf_verifier_log *log,
+		const struct bpf_reg_state *reg, int off, int size)
+{
+	return -EACCES;
+}
+
+static bool bpf_mthp_ops_is_valid_access(int off, int size, enum bpf_access_type type,
+		const struct bpf_prog *prog, struct bpf_insn_access_aux *info)
+{
+	return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
+}
+
+const struct bpf_verifier_ops bpf_mthp_verifier_ops = {
+	.get_func_proto = bpf_base_func_proto,
+	.btf_struct_access = bpf_mthp_ops_btf_struct_access,
+	.is_valid_access = bpf_mthp_ops_is_valid_access,
+};
+
+static int bpf_mthp_ops_reg(void *kdata, struct bpf_link *link)
+{
+	struct bpf_struct_ops_link *st_link = (struct bpf_struct_ops_link *)link;
+	struct bpf_mthp_ops *ops = kdata;
+	struct cgroup *cgrp = st_link->cgroup;
+	struct cgroup_subsys_state *pos;
+
+	/* The link is not yet fully initialized, but cgroup should be set */
+	if (!link)
+		return -EOPNOTSUPP;
+
+	cgroup_lock();
+	css_for_each_descendant_pre(pos, &cgrp->self) {
+		struct cgroup *child = pos->cgroup;
+
+		if (READ_ONCE(child->mthp_ops)) {
+			/* TODO
+			 * Do not destroy the cgroup hierarchy property.
+			 * If an eBPF program already exists in the sub-cgroup,
+			 * trigger an error and clear the already set
+			 * bpf_mthp_ops data.
+			 */
+			continue;
+		}
+		WRITE_ONCE(child->mthp_ops, ops);
+	}
+	cgroup_unlock();
+
+	return 0;
+}
+
+static void bpf_mthp_ops_unreg(void *kdata, struct bpf_link *link)
+{
+	struct bpf_struct_ops_link *st_link = (struct bpf_struct_ops_link *)link;
+	struct bpf_mthp_ops *ops = kdata;
+	struct cgroup *cgrp = st_link->cgroup;
+	struct cgroup_subsys_state *pos;
+
+	cgroup_lock();
+	css_for_each_descendant_pre(pos, &cgrp->self) {
+		struct cgroup *child = pos->cgroup;
+
+		if (READ_ONCE(child->mthp_ops) == ops)
+			WRITE_ONCE(child->mthp_ops, NULL);
+	}
+	cgroup_unlock();
+
+	synchronize_srcu(&mthp_bpf_srcu);
+}
+
+static int bpf_mthp_ops_check_member(const struct btf_type *t,
+				     const struct btf_member *member,
+				     const struct bpf_prog *prog)
+{
+	u32 moff = __btf_member_bit_offset(t, member) / 8;
+
+	switch (moff) {
+	case offsetof(struct bpf_mthp_ops, mthp_choose):
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	if (prog->sleepable)
+		return -EINVAL;
+
+	return 0;
+}
+
+static int bpf_mthp_ops_init_member(const struct btf_type *t,
+				    const struct btf_member *member,
+				    void *kdata, const void *udata)
+{
+	return 0;
+}
+
+static int bpf_mthp_ops_init(struct btf *btf)
+{
+	return 0;
+}
+
+static unsigned long cfi_mthp_choose(struct cgroup *cgrp, unsigned long orders)
+{
+	return 0;
+}
+
+static struct bpf_mthp_ops cfi_bpf_mthp_ops = {
+	.mthp_choose = cfi_mthp_choose,
+};
+
+static struct bpf_struct_ops bso_bpf_mthp_ops = {
+	.verifier_ops = &bpf_mthp_verifier_ops,
+	.reg = bpf_mthp_ops_reg,
+	.unreg = bpf_mthp_ops_unreg,
+	.check_member = bpf_mthp_ops_check_member,
+	.init_member = bpf_mthp_ops_init_member,
+	.init = bpf_mthp_ops_init,
+	.name = "bpf_mthp_ops",
+	.owner = THIS_MODULE,
+	.cfi_stubs = &cfi_bpf_mthp_ops,
+};
+
+static int __init bpf_huge_memory_init(void)
+{
+	int err;
+
+	err = register_bpf_struct_ops(&bso_bpf_mthp_ops, bpf_mthp_ops);
+	if (err)
+		pr_warn("Registration of bpf_mthp_ops failed, err %d\n", err);
+
+	return err;
+}
+late_initcall(bpf_huge_memory_init);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] mm: introduce bpf_mthp_ops struct ops
  2026-05-03 16:50 ` [PATCH 3/4] mm: introduce bpf_mthp_ops struct ops Vernon Yang
@ 2026-05-03 17:35   ` bot+bpf-ci
  2026-05-06 13:06     ` Vernon Yang
       [not found]   ` <20260503174125.2C949C2BCB4@smtp.kernel.org>
  1 sibling, 1 reply; 14+ messages in thread
From: bot+bpf-ci @ 2026-05-03 17:35 UTC (permalink / raw)
  To: vernon2gm, akpm, david, ljs, roman.gushchin, inwardvessel,
	shakeel.butt, ast, daniel, surenb
  Cc: linux-kernel, linux-mm, bpf, baohua, lance.yang, dev.jain,
	yanglincheng, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 6024 bytes --]

> diff --git a/include/linux/bpf_huge_memory.h b/include/linux/bpf_huge_memory.h
> new file mode 100644
> index 000000000000..1c8a6f7ad8f1
> --- /dev/null
> +++ b/include/linux/bpf_huge_memory.h
> @@ -0,0 +1,35 @@
> +/* SPDX-License-Identifier: GPL-2.0+ */
> +
> +#ifndef __BPF_HUGE_MEMORY_H
> +#define __BPF_HUGE_MEMORY_H
> +
> +/**
> + * struct bpf_mthp_ops - BPF callbacks for mTHP operations
> + * @mthp_choose: Choose the custom mTHP orders
> + *
> + * This structure defines the interface for BPF programs to customize
> + * mTHP behavior through struct_ops programs.
> + */
> +struct bpf_mthp_ops {
> +	unsigned long (*mthp_choose)(struct cgroup *cgrp, unsigned long orders);
> +};
> +
> +#if defined(CONFIG_BPF_TRANSPARENT_HUGEPAGE) && defined(CONFIG_BPF_SYSCALL)
> +/**
> + * bpf_mthp_choose: Choose the custom mTHP orders using bpf
                      ^
The kernel-doc format for functions uses parentheses followed by a dash, not
a colon. When scripts/kernel-doc processes this file, will it skip this
function documentation?

> + * @mm: task mm_struct
> + * @orders: original orders
> + *
> + * Return suited mTHP orders.
> + */

[ ... ]

> diff --git a/mm/bpf_huge_memory.c b/mm/bpf_huge_memory.c
> new file mode 100644
> index 000000000000..e34e0a35edac
> --- /dev/null
> +++ b/mm/bpf_huge_memory.c
> @@ -0,0 +1,169 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Huge memory related BPF code
> + *
> + * Author: Vernon Yang <yanglincheng@kylinos.cn>
> + */
> +
> +#include <linux/bpf.h>
> +#include <linux/srcu.h>
> +
> +/* Protects cgrp->mthp_ops pointer for read and write. */
> +DEFINE_SRCU(mthp_bpf_srcu);
> +
> +unsigned long bpf_mthp_choose(struct mm_struct *mm, unsigned long orders)
> +{
> +	struct cgroup *cgrp;
> +	struct mem_cgroup *memcg;
> +	struct bpf_mthp_ops *ops;
> +	int idx;
> +
> +	memcg = get_mem_cgroup_from_mm(mm);
> +	if (!memcg)
> +		return orders;
> +
> +	cgrp = memcg->css.cgroup;
> +	ops = READ_ONCE(cgrp->mthp_ops);
                ^^^^^
> +	if (unlikely(ops)) {
> +		idx = srcu_read_lock(&mthp_bpf_srcu);
                      ^^^^^^^^^^^^^^^^^
Could there be a use-after-free here? The pointer is read BEFORE entering
the SRCU read-side critical section. If bpf_mthp_ops_unreg runs on another
CPU between these two lines, synchronize_srcu returns immediately because
no reader holds the lock yet.

Race sequence:
(1) Thread A loads ops (non-NULL) above
(2) Thread A is preempted
(3) Thread B calls bpf_mthp_ops_unreg, clears the pointer, calls
    synchronize_srcu which returns immediately
(4) Thread B frees the backing memory
(5) Thread A resumes, enters SRCU section, then dereferences ops->mthp_choose

For SRCU to protect the pointer across its use, the reader must hold
srcu_read_lock before reading the pointer. Would moving the srcu_read_lock
before the READ_ONCE prevent this race?

> +		if (ops->mthp_choose)
> +			orders = ops->mthp_choose(cgrp, orders);
> +		srcu_read_unlock(&mthp_bpf_srcu, idx);
> +	}
> +
> +	mem_cgroup_put(memcg);
> +
> +	return orders;
> +}

[ ... ]

> +static int bpf_mthp_ops_reg(void *kdata, struct bpf_link *link)
> +{
> +	struct bpf_struct_ops_link *st_link = (struct bpf_struct_ops_link *)link;
> +	struct bpf_mthp_ops *ops = kdata;
> +	struct cgroup *cgrp = st_link->cgroup;
                                ^^^^^^^^^^^^
Can this dereference NULL? When bpf_struct_ops_map_update_elem calls the
reg callback at kernel/bpf/bpf_struct_ops.c:873, it passes link = NULL for
maps without BPF_F_LINK. When link is NULL, st_link->cgroup dereferences
NULL before the check at the next line runs.

> +	struct cgroup_subsys_state *pos;
> +
> +	/* The link is not yet fully initialized, but cgroup should be set */
> +	if (!link)
> +		return -EOPNOTSUPP;

Does this comment match reality? Looking at bpf_struct_ops_link_create in
kernel/bpf/bpf_struct_ops.c:1397, the link was fully initialized by
bpf_link_init and bpf_link_prime before calling reg. Also, struct
bpf_struct_ops_link does not have a cgroup field in its definition at
kernel/bpf/bpf_struct_ops.c:58.

Also, does struct bpf_struct_ops_link actually have a cgroup field? Its
definition shows only { struct bpf_link link; struct bpf_map __rcu *map;
wait_queue_head_t wait_hup; }. The patch does not add this field to the
struct. Would this cause a compilation error?

> +
> +	cgroup_lock();
> +	css_for_each_descendant_pre(pos, &cgrp->self) {
> +		struct cgroup *child = pos->cgroup;
> +
> +		if (READ_ONCE(child->mthp_ops)) {
> +			/* TODO
                    ^^^^^
This isn't a bug, but per BPF comment style (subsystem/bpf.md), multi-line
comments should have the opening /* on its own line in BPF-related files.

> +			 * Do not destroy the cgroup hierarchy property.
> +			 * If an eBPF program already exists in the sub-cgroup,
> +			 * trigger an error and clear the already set
> +			 * bpf_mthp_ops data.
> +			 */
> +			continue;
> +		}
> +		WRITE_ONCE(child->mthp_ops, ops);
> +	}
> +	cgroup_unlock();
> +
> +	return 0;
> +}
> +
> +static void bpf_mthp_ops_unreg(void *kdata, struct bpf_link *link)
> +{
> +	struct bpf_struct_ops_link *st_link = (struct bpf_struct_ops_link *)link;
> +	struct bpf_mthp_ops *ops = kdata;
> +	struct cgroup *cgrp = st_link->cgroup;
                                ^^^^^^^^^^^^
Can this dereference NULL? Looking at bpf_struct_ops_map_delete_elem in
kernel/bpf/bpf_struct_ops.c:936, it calls unreg with link = NULL. When
link is NULL, st_link->cgroup dereferences NULL.

> +	struct cgroup_subsys_state *pos;
> +
> +	cgroup_lock();
> +	css_for_each_descendant_pre(pos, &cgrp->self) {
> +		struct cgroup *child = pos->cgroup;
> +
> +		if (READ_ONCE(child->mthp_ops) == ops)
> +			WRITE_ONCE(child->mthp_ops, NULL);
> +	}
> +	cgroup_unlock();
> +
> +	synchronize_srcu(&mthp_bpf_srcu);
> +}


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/25285279315

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/4] mm: introduce bpf_mthp_ops struct ops
  2026-05-03 17:35   ` bot+bpf-ci
@ 2026-05-06 13:06     ` Vernon Yang
  0 siblings, 0 replies; 14+ messages in thread
From: Vernon Yang @ 2026-05-06 13:06 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: akpm, david, ljs, roman.gushchin, inwardvessel, shakeel.butt, ast,
	daniel, surenb, linux-kernel, linux-mm, bpf, baohua, lance.yang,
	dev.jain, yanglincheng, andrii, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

On Mon, May 4, 2026 at 1:35 AM <bot+bpf-ci@kernel.org> wrote:
>
> > diff --git a/include/linux/bpf_huge_memory.h b/include/linux/bpf_huge_memory.h
> > new file mode 100644
> > index 000000000000..1c8a6f7ad8f1
> > --- /dev/null
> > +++ b/include/linux/bpf_huge_memory.h
> > @@ -0,0 +1,35 @@
> > +/* SPDX-License-Identifier: GPL-2.0+ */
> > +
> > +#ifndef __BPF_HUGE_MEMORY_H
> > +#define __BPF_HUGE_MEMORY_H
> > +
> > +/**
> > + * struct bpf_mthp_ops - BPF callbacks for mTHP operations
> > + * @mthp_choose: Choose the custom mTHP orders
> > + *
> > + * This structure defines the interface for BPF programs to customize
> > + * mTHP behavior through struct_ops programs.
> > + */
> > +struct bpf_mthp_ops {
> > +     unsigned long (*mthp_choose)(struct cgroup *cgrp, unsigned long orders);
> > +};
> > +
> > +#if defined(CONFIG_BPF_TRANSPARENT_HUGEPAGE) && defined(CONFIG_BPF_SYSCALL)
> > +/**
> > + * bpf_mthp_choose: Choose the custom mTHP orders using bpf
>                       ^
> The kernel-doc format for functions uses parentheses followed by a dash, not
> a colon. When scripts/kernel-doc processes this file, will it skip this
> function documentation?

I will fix that in the next version. Thanks!

> > + * @mm: task mm_struct
> > + * @orders: original orders
> > + *
> > + * Return suited mTHP orders.
> > + */
>
> [ ... ]
>
> > diff --git a/mm/bpf_huge_memory.c b/mm/bpf_huge_memory.c
> > new file mode 100644
> > index 000000000000..e34e0a35edac
> > --- /dev/null
> > +++ b/mm/bpf_huge_memory.c
> > @@ -0,0 +1,169 @@
> > +// SPDX-License-Identifier: GPL-2.0-or-later
> > +/*
> > + * Huge memory related BPF code
> > + *
> > + * Author: Vernon Yang <yanglincheng@kylinos.cn>
> > + */
> > +
> > +#include <linux/bpf.h>
> > +#include <linux/srcu.h>
> > +
> > +/* Protects cgrp->mthp_ops pointer for read and write. */
> > +DEFINE_SRCU(mthp_bpf_srcu);
> > +
> > +unsigned long bpf_mthp_choose(struct mm_struct *mm, unsigned long orders)
> > +{
> > +     struct cgroup *cgrp;
> > +     struct mem_cgroup *memcg;
> > +     struct bpf_mthp_ops *ops;
> > +     int idx;
> > +
> > +     memcg = get_mem_cgroup_from_mm(mm);
> > +     if (!memcg)
> > +             return orders;
> > +
> > +     cgrp = memcg->css.cgroup;
> > +     ops = READ_ONCE(cgrp->mthp_ops);
>                 ^^^^^
> > +     if (unlikely(ops)) {
> > +             idx = srcu_read_lock(&mthp_bpf_srcu);
>                       ^^^^^^^^^^^^^^^^^
> Could there be a use-after-free here? The pointer is read BEFORE entering
> the SRCU read-side critical section. If bpf_mthp_ops_unreg runs on another
> CPU between these two lines, synchronize_srcu returns immediately because
> no reader holds the lock yet.
>
> Race sequence:
> (1) Thread A loads ops (non-NULL) above
> (2) Thread A is preempted
> (3) Thread B calls bpf_mthp_ops_unreg, clears the pointer, calls
>     synchronize_srcu which returns immediately
> (4) Thread B frees the backing memory
> (5) Thread A resumes, enters SRCU section, then dereferences ops->mthp_choose
>
> For SRCU to protect the pointer across its use, the reader must hold
> srcu_read_lock before reading the pointer. Would moving the srcu_read_lock
> before the READ_ONCE prevent this race?

Good catch, I will fix that in the next version. Thanks!

> > +             if (ops->mthp_choose)
> > +                     orders = ops->mthp_choose(cgrp, orders);
> > +             srcu_read_unlock(&mthp_bpf_srcu, idx);
> > +     }
> > +
> > +     mem_cgroup_put(memcg);
> > +
> > +     return orders;
> > +}
>
> [ ... ]
>
> > +static int bpf_mthp_ops_reg(void *kdata, struct bpf_link *link)
> > +{
> > +     struct bpf_struct_ops_link *st_link = (struct bpf_struct_ops_link *)link;
> > +     struct bpf_mthp_ops *ops = kdata;
> > +     struct cgroup *cgrp = st_link->cgroup;
>                                 ^^^^^^^^^^^^
> Can this dereference NULL? When bpf_struct_ops_map_update_elem calls the
> reg callback at kernel/bpf/bpf_struct_ops.c:873, it passes link = NULL for
> maps without BPF_F_LINK. When link is NULL, st_link->cgroup dereferences
> NULL before the check at the next line runs.

Yeah, I will fix that in the next version. Thanks!

> > +     struct cgroup_subsys_state *pos;
> > +
> > +     /* The link is not yet fully initialized, but cgroup should be set */
> > +     if (!link)
> > +             return -EOPNOTSUPP;
>
> Does this comment match reality? Looking at bpf_struct_ops_link_create in
> kernel/bpf/bpf_struct_ops.c:1397, the link was fully initialized by
> bpf_link_init and bpf_link_prime before calling reg. Also, struct
> bpf_struct_ops_link does not have a cgroup field in its definition at
> kernel/bpf/bpf_struct_ops.c:58.
>
> Also, does struct bpf_struct_ops_link actually have a cgroup field? Its
> definition shows only { struct bpf_link link; struct bpf_map __rcu *map;
> wait_queue_head_t wait_hup; }. The patch does not add this field to the
> struct. Would this cause a compilation error?

This series is based on linux v7.1-rc1 (26fd6bff2c05) + "mm: BPF OOM"[3]
first four patches.

All of the above definitions are in  "mm: BPF OOM" PATCH#2.

https://lore.kernel.org/linux-mm/20260127024421.494929-3-roman.gushchin@linux.dev/

> > +
> > +     cgroup_lock();
> > +     css_for_each_descendant_pre(pos, &cgrp->self) {
> > +             struct cgroup *child = pos->cgroup;
> > +
> > +             if (READ_ONCE(child->mthp_ops)) {
> > +                     /* TODO
>                     ^^^^^
> This isn't a bug, but per BPF comment style (subsystem/bpf.md), multi-line
> comments should have the opening /* on its own line in BPF-related files.
>
> > +                      * Do not destroy the cgroup hierarchy property.
> > +                      * If an eBPF program already exists in the sub-cgroup,
> > +                      * trigger an error and clear the already set
> > +                      * bpf_mthp_ops data.
> > +                      */
> > +                     continue;
> > +             }
> > +             WRITE_ONCE(child->mthp_ops, ops);
> > +     }
> > +     cgroup_unlock();
> > +
> > +     return 0;
> > +}
> > +
> > +static void bpf_mthp_ops_unreg(void *kdata, struct bpf_link *link)
> > +{
> > +     struct bpf_struct_ops_link *st_link = (struct bpf_struct_ops_link *)link;
> > +     struct bpf_mthp_ops *ops = kdata;
> > +     struct cgroup *cgrp = st_link->cgroup;
>                                 ^^^^^^^^^^^^
> Can this dereference NULL? Looking at bpf_struct_ops_map_delete_elem in
> kernel/bpf/bpf_struct_ops.c:936, it calls unreg with link = NULL. When
> link is NULL, st_link->cgroup dereferences NULL.

Yeah, I will fix that in the next version. Thanks!

> > +     struct cgroup_subsys_state *pos;
> > +
> > +     cgroup_lock();
> > +     css_for_each_descendant_pre(pos, &cgrp->self) {
> > +             struct cgroup *child = pos->cgroup;
> > +
> > +             if (READ_ONCE(child->mthp_ops) == ops)
> > +                     WRITE_ONCE(child->mthp_ops, NULL);
> > +     }
> > +     cgroup_unlock();
> > +
> > +     synchronize_srcu(&mthp_bpf_srcu);
> > +}
>
>
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
>
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/25285279315

^ permalink raw reply	[flat|nested] 14+ messages in thread

[parent not found: <20260503174125.2C949C2BCB4@smtp.kernel.org>]

* Re: [PATCH 3/4] mm: introduce bpf_mthp_ops struct ops
       [not found]   ` <20260503174125.2C949C2BCB4@smtp.kernel.org>
@ 2026-05-06 13:26     ` Vernon Yang
  0 siblings, 0 replies; 14+ messages in thread
From: Vernon Yang @ 2026-05-06 13:26 UTC (permalink / raw)
  To: sashiko
  Cc: bpf, Andrew Morton, David Hildenbrand (Red Hat),
	Lorenzo Stoakes (Oracle), Roman Gushchin, inwardvessel,
	Shakeel Butt, ast, Daniel Borkmann, surenb, linux-kernel,
	linux-mm, baohua, Lance Yang, Dev Jain

Forwarding sashiko review

On Mon, May 4, 2026 at 1:41 AM <sashiko-bot@kernel.org> wrote:
>
> Thank you for your contribution! Sashiko AI review found 6 potential issue(s) to consider:
> - [Critical] Use-After-Free vulnerability in `bpf_mthp_choose()` due to improper SRCU locking.
> - [High] NULL pointer dereference of `link` before validation in `bpf_mthp_ops_reg()` and `bpf_mthp_ops_unreg()`.
> - [High] Newly created cgroups silently bypass the hierarchical BPF mTHP policy.
> - [High] Silent registration failure leaves the cgroup hierarchy in an inconsistent state.
> - [Medium] The BPF program can only restrict allowable THP orders, not expand or select new ones.
> - [Low] Compilation failure due to dereferencing an incomplete type and a non-existent struct member.
> --
>
> commit e5c9c74b11f76b4de7ba4ef8ff94c10bae978e16
> Author: Vernon Yang <yanglincheng@kylinos.cn>
>
> mm: introduce bpf_mthp_ops struct ops
>
> Introducing bpf_mthp_ops enables eBPF programs to register the
> mthp_choose callback function via cgroup-ebpf.
>
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 2949e5acff35..80ec622213df 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> [ ... ]
> > @@ -291,6 +292,11 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
> >                                      enum tva_type type,
> >                                      unsigned long orders)
> >  {
> > +     /* The eBPF-specified orders overrides which order is selected. */
> > +     orders &= bpf_mthp_choose(vma->vm_mm, orders);
>
> Does this inadvertently prevent the BPF program from enabling or selecting new
> orders? Since the return value is applied using a bitwise AND, it appears
> restricted to only disabling existing orders rather than customizing them as
> described in the commit message.

Yes, It is restricted to only disabling existing orders.

> > +     if (!orders)
> > +             return 0;
> > +
> >       /*
> >        * Optimization to check if required orders are enabled early. Only
> >        * forced collapse ignores sysfs configs.
> > diff --git a/mm/bpf_huge_memory.c b/mm/bpf_huge_memory.c
> > new file mode 100644
> > index 000000000000..e34e0a35edac
> > --- /dev/null
> > +++ b/mm/bpf_huge_memory.c
> [ ... ]
> > +unsigned long bpf_mthp_choose(struct mm_struct *mm, unsigned long orders)
> > +{
> > +     struct cgroup *cgrp;
> > +     struct mem_cgroup *memcg;
> > +     struct bpf_mthp_ops *ops;
> > +     int idx;
> > +
> > +     memcg = get_mem_cgroup_from_mm(mm);
> > +     if (!memcg)
> > +             return orders;
> > +
> > +     cgrp = memcg->css.cgroup;
> > +     ops = READ_ONCE(cgrp->mthp_ops);
> > +     if (unlikely(ops)) {
> > +             idx = srcu_read_lock(&mthp_bpf_srcu);
> > +             if (ops->mthp_choose)
> > +                     orders = ops->mthp_choose(cgrp, orders);
> > +             srcu_read_unlock(&mthp_bpf_srcu, idx);
> > +     }
>
> Could reading cgrp->mthp_ops outside the SRCU critical section lead to a
> use-after-free regression?
>
> If a thread is preempted immediately after ops is read:
>
>     ops = READ_ONCE(cgrp->mthp_ops);
>     <preempted>
>
> Another thread could execute bpf_mthp_ops_unreg(), set the pointer to NULL,
> and complete synchronize_srcu(). The original thread would then resume,
> acquire the lock, and dereference the freed ops pointer.

(Report is identical to bot+bpf-ci) Good catch, I will fix that in the
next version. Thanks!

> > +
> > +     mem_cgroup_put(memcg);
> > +
> > +     return orders;
> > +}
> [ ... ]
> > +static int bpf_mthp_ops_reg(void *kdata, struct bpf_link *link)
> > +{
> > +     struct bpf_struct_ops_link *st_link = (struct bpf_struct_ops_link *)link;
> > +     struct bpf_mthp_ops *ops = kdata;
> > +     struct cgroup *cgrp = st_link->cgroup;
>
> Can this cause a regression if link is passed as NULL? st_link is cast from
> link, and st_link->cgroup implicitly dereferences it before the validation
> check below.

(Report is identical to bot+bpf-ci) I will fix that in the next version. Thanks!

> Additionally, struct bpf_struct_ops_link is defined privately in
> kernel/bpf/bpf_struct_ops.c and does not contain a cgroup field. Will this
> cause a compilation failure due to dereferencing an incomplete type and a
> non-existent struct member?

(Report is identical to bot+bpf-ci)

This series is based on linux v7.1-rc1 (26fd6bff2c05) + "mm: BPF OOM"[3]
first four patches.

All of the above definitions are in  "mm: BPF OOM" PATCH#2.

https://lore.kernel.org/linux-mm/20260127024421.494929-3-roman.gushchin@linux.dev/

> > +     struct cgroup_subsys_state *pos;
> > +
> > +     /* The link is not yet fully initialized, but cgroup should be set */
> > +     if (!link)
> > +             return -EOPNOTSUPP;
> > +
> > +     cgroup_lock();
> > +     css_for_each_descendant_pre(pos, &cgrp->self) {
> > +             struct cgroup *child = pos->cgroup;
> > +
> > +             if (READ_ONCE(child->mthp_ops)) {
> > +                     /* TODO
> > +                      * Do not destroy the cgroup hierarchy property.
> > +                      * If an eBPF program already exists in the sub-cgroup,
> > +                      * trigger an error and clear the already set
> > +                      * bpf_mthp_ops data.
> > +                      */
> > +                     continue;
>
> Is it intentional to return 0 at the end of this function when we skip over
> descendants that already have mthp_ops configured? This appears to leave the
> hierarchy in an inconsistent state while userspace assumes the registration
> was fully successful.

Yes, that is exactly what I will do next.

> Furthermore, is there a missing hook in the cgroup creation path? Since this
> loop only applies to existing descendants, any newly created cgroups will
> have their mthp_ops field initialized to NULL, allowing tasks moved there to
> silently bypass the BPF program attached to their ancestors.

Good catch, I will set WRITE_ONCE(child->mthp_ops, parent->mthp_ops)
during cgroup_create(). Thanks!

> > +             }
> > +             WRITE_ONCE(child->mthp_ops, ops);
> > +     }
> > +     cgroup_unlock();
> > +
> > +     return 0;
> > +}
> > +
> > +static void bpf_mthp_ops_unreg(void *kdata, struct bpf_link *link)
> > +{
> > +     struct bpf_struct_ops_link *st_link = (struct bpf_struct_ops_link *)link;
> > +     struct bpf_mthp_ops *ops = kdata;
> > +     struct cgroup *cgrp = st_link->cgroup;
>
> Does this code similarly dereference link before it can be validated?
>
> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260503165024.1526680-1-vernon2gm@gmail.com?part=3

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 4/4] samples: bpf: add mthp_ext
  2026-05-03 16:50 [PATCH 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent Vernon Yang
                   ` (2 preceding siblings ...)
  2026-05-03 16:50 ` [PATCH 3/4] mm: introduce bpf_mthp_ops struct ops Vernon Yang
@ 2026-05-03 16:50 ` Vernon Yang
  2026-05-03 17:35   ` bot+bpf-ci
       [not found]   ` <20260503175737.6190AC2BCB4@smtp.kernel.org>
  3 siblings, 2 replies; 14+ messages in thread
From: Vernon Yang @ 2026-05-03 16:50 UTC (permalink / raw)
  To: akpm, david, ljs, roman.gushchin, inwardvessel, shakeel.butt, ast,
	daniel, surenb
  Cc: linux-kernel, linux-mm, bpf, baohua, lance.yang, dev.jain,
	Vernon Yang

From: Vernon Yang <yanglincheng@kylinos.cn>

Design mthp_ext case to address real workload issues.

The main functions of the mthp_ext are as follows:

- When sub-cgroup is under high memory pressure (default, full 100ms 1s),
  it will automatically fallback to using 4KB.
- When the anon+shmem memory usage of sub-cgroup falls below the minimum
  memory (default 16MB), small-memory processes will automatically
  fallback to using 4KB.
- Under normal conditions, when there is no memory pressure and the
  anon+shmem memory usage exceeds the minimum memory, all mTHP sizes
  shall be utilized by kernel.
- Monitor the root-cgroup (/sys/fs/cgroup) directory by default, with
  support for specifying any cgroup directory.

Signed-off-by: Vernon Yang <yanglincheng@kylinos.cn>
---
 samples/bpf/.gitignore     |   1 +
 samples/bpf/Makefile       |   7 +-
 samples/bpf/mthp_ext.bpf.c | 142 ++++++++++++++++
 samples/bpf/mthp_ext.c     | 340 +++++++++++++++++++++++++++++++++++++
 samples/bpf/mthp_ext.h     |  30 ++++
 5 files changed, 519 insertions(+), 1 deletion(-)
 create mode 100644 samples/bpf/mthp_ext.bpf.c
 create mode 100644 samples/bpf/mthp_ext.c
 create mode 100644 samples/bpf/mthp_ext.h

diff --git a/samples/bpf/.gitignore b/samples/bpf/.gitignore
index 0002cd359fb1..2a73581876b4 100644
--- a/samples/bpf/.gitignore
+++ b/samples/bpf/.gitignore
@@ -49,3 +49,4 @@ iperf.*
 /vmlinux.h
 /bpftool/
 /libbpf/
+mthp_ext
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 95a4fa1f1e44..357c7d1c45ef 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -37,6 +37,7 @@ tprogs-y += xdp_fwd
 tprogs-y += task_fd_query
 tprogs-y += ibumad
 tprogs-y += hbm
+tprogs-y += mthp_ext
 
 # Libbpf dependencies
 LIBBPF_SRC = $(TOOLS_PATH)/lib/bpf
@@ -122,6 +123,7 @@ always-y += task_fd_query_kern.o
 always-y += ibumad_kern.o
 always-y += hbm_out_kern.o
 always-y += hbm_edt_kern.o
+always-y += mthp_ext.bpf.o
 
 COMMON_CFLAGS = $(TPROGS_USER_CFLAGS)
 TPROGS_LDFLAGS = $(TPROGS_USER_LDFLAGS)
@@ -289,6 +291,8 @@ $(obj)/hbm_out_kern.o: $(src)/hbm.h $(src)/hbm_kern.h
 $(obj)/hbm.o: $(src)/hbm.h
 $(obj)/hbm_edt_kern.o: $(src)/hbm.h $(src)/hbm_kern.h
 
+mthp_ext: $(obj)/mthp_ext.skel.h
+
 # Override includes for xdp_sample_user.o because $(srctree)/usr/include in
 # TPROGS_CFLAGS causes conflicts
 XDP_SAMPLE_CFLAGS += -Wall -O2 \
@@ -347,10 +351,11 @@ $(obj)/%.bpf.o: $(src)/%.bpf.c $(obj)/vmlinux.h $(src)/xdp_sample.bpf.h $(src)/x
 		-I$(LIBBPF_INCLUDE) $(CLANG_SYS_INCLUDES) \
 		-c $(filter %.bpf.c,$^) -o $@
 
-LINKED_SKELS := xdp_router_ipv4.skel.h
+LINKED_SKELS := xdp_router_ipv4.skel.h mthp_ext.skel.h
 clean-files += $(LINKED_SKELS)
 
 xdp_router_ipv4.skel.h-deps := xdp_router_ipv4.bpf.o xdp_sample.bpf.o
+mthp_ext.skel.h-deps := mthp_ext.bpf.o
 
 LINKED_BPF_SRCS := $(patsubst %.bpf.o,%.bpf.c,$(foreach skel,$(LINKED_SKELS),$($(skel)-deps)))
 
diff --git a/samples/bpf/mthp_ext.bpf.c b/samples/bpf/mthp_ext.bpf.c
new file mode 100644
index 000000000000..bbee3e9f679c
--- /dev/null
+++ b/samples/bpf/mthp_ext.bpf.c
@@ -0,0 +1,142 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include "mthp_ext.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_core_read.h>
+#include <vdso/bits.h>
+
+struct mem_info {
+	unsigned long stall;
+	unsigned int  order;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_CGRP_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, struct mem_info);
+} cgrp_storage SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_RINGBUF);
+	__uint(max_entries, 256 * 1024);
+} events SEC(".maps");
+
+struct config_local configs;
+
+/*
+ * mthp_choose_impl: Choose the custom mTHP orders, read order from cgrp_storage,
+ *		     which is Adjustment by the cgroup_scan().
+ * @cgrp: control group
+ * @orders: original orders
+ *
+ * Return suited mTHP orders.
+ */
+SEC("struct_ops/mthp_choose")
+unsigned long BPF_PROG(mthp_choose_impl, struct cgroup *cgrp, unsigned long orders)
+{
+	struct mem_info *info;
+	unsigned int order;
+
+	if (configs.fixed) {
+		order = configs.init_order;
+		goto out;
+	}
+
+	info = bpf_cgrp_storage_get(&cgrp_storage, cgrp, 0, 0);
+	if (!info)
+		return orders;
+
+	order = info->order;
+out:
+	if (!order)
+		return 0;
+
+	orders &= BIT(order + 1) - 1;
+	return orders;
+}
+
+SEC(".struct_ops.link")
+struct bpf_mthp_ops mthp_ops = {
+	.mthp_choose = (void *)mthp_choose_impl,
+};
+
+/* backport from kernel/cgroup/cgroup.c */
+static bool cgroup_has_tasks(struct cgroup *cgrp)
+{
+	return cgrp->nr_populated_csets;
+}
+
+/*
+ * cgroup_scan: scan all descendant cgroups under root cgroup.
+ *
+ * 1. When the memory usage of the sub-cgroup falls below the <min> threshold,
+ *    it will automatically fall back to using 4KB size; otherwise, it will
+ *    use all mTHP sizes.
+ * 2. When memory.pressure stall time of the sub-cgroup exceeds <threshold>,
+ *    it will automatically fall back to using 4KB size; otherwise, it will
+ *    use all mTHP sizes.
+ *
+ * Return 1 indicates termination of the iteration loop, and return 0 indicates
+ * iteration to the next sub-cgroup.
+ */
+SEC("iter.s/cgroup")
+int cgroup_scan(struct bpf_iter__cgroup *ctx)
+{
+	struct cgroup *cgrp = ctx->cgroup;
+	struct mem_cgroup *memcg;
+	struct mem_info *info;
+	struct alert_event *e;
+	unsigned long curr_stall;
+	unsigned long curr_mem;
+	unsigned long delta;
+
+	if (!cgrp)
+		return 1;
+
+	if (!cgroup_has_tasks(cgrp))
+		return 0;
+
+	info = bpf_cgrp_storage_get(&cgrp_storage, cgrp, 0,
+				    BPF_LOCAL_STORAGE_GET_F_CREATE);
+	if (!info)
+		return 0;
+
+	memcg = bpf_get_mem_cgroup(&cgrp->self);
+	if (!memcg)
+		return 0;
+
+	bpf_cgroup_flush_stats(cgrp);
+	curr_stall = bpf_cgroup_stall(cgrp, PSI_MEM_FULL);
+	delta = curr_stall - info->stall;
+	bpf_mem_cgroup_flush_stats(memcg);
+	curr_mem = bpf_mem_cgroup_page_state(memcg, NR_ANON_MAPPED) +
+		   bpf_mem_cgroup_page_state(memcg, NR_SHMEM);
+	if (curr_mem < FROM_MB(configs.min_mem) || delta >= configs.threshold)
+		info->order = 0;
+	else
+		info->order = PMD_ORDER;
+
+	if (configs.debug) {
+		e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
+		if (e) {
+			e->prev_stall = info->stall;
+			e->curr_stall = curr_stall;
+			e->delta = delta;
+			e->mem = curr_mem;
+			e->order = info->order;
+			bpf_probe_read_kernel_str(e->name, sizeof(e->name),
+						  cgrp->kn->name);
+			bpf_ringbuf_submit(e, 0);
+		}
+	}
+
+	info->stall = curr_stall;
+	bpf_put_mem_cgroup(memcg);
+
+	return 0;
+}
+
+char LICENSE[] SEC("license") = "GPL";
diff --git a/samples/bpf/mthp_ext.c b/samples/bpf/mthp_ext.c
new file mode 100644
index 000000000000..0e064bad136f
--- /dev/null
+++ b/samples/bpf/mthp_ext.c
@@ -0,0 +1,340 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <string.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <signal.h>
+#include <time.h>
+#include <stdbool.h>
+#include <getopt.h>
+#include <sys/epoll.h>
+#include <linux/limits.h>
+#include <sys/stat.h>
+#include <linux/bpf.h>
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+#include "mthp_ext.h"
+#include "mthp_ext.skel.h"
+
+#define DEFAULT_ROOT		"/sys/fs/cgroup"
+#define DEFAULT_THRESHOLD_MS	100UL
+#define DEFAULT_INTERVAL_MS	1000UL
+#define DEFAULT_ORDER		PMD_ORDER
+#define DEFAULT_MIN_MEM		16
+
+static bool exiting;
+
+static void usage(const char *name)
+{
+	fprintf(stderr,
+		"Usage: %s [OPTIONS]\n\n"
+		"Monitor specified cgroup, adjust mTHP size via cgroup_bpf.\n\n"
+		"Currently supports fixed mTHP size and automatic mTHP size adjustment.\n"
+		"By default, it monitors the entire cgroup and automatically\n"
+		"adjusts mTHP size within the specified time window <interval>.\n"
+		"1. When the memory size of the sub-cgroup falls below\n"
+		"   the <min> threshold, it will automatically fall back to\n"
+		"   using 4KB size; otherwise, it will use all mTHP sizes.\n"
+		"2. When memory.pressure stall time of the sub-cgroup exceeds\n"
+		"   <threshold>, it will automatically fall back to using 4KB\n"
+		"   size; otherwise, it will use all mTHP sizes.\n\n"
+		"Options:\n"
+		"  -r, --root=PATH        Root cgroup path (default: /sys/fs/cgroup)\n"
+		"  -t, --threshold=MS     threshold in ms (default: %lu)\n"
+		"  -i, --interval=MS      interval in ms (default: %lu)\n"
+		"  -o, --order=NR         Initial mthp order (default: %d)\n"
+		"  -m, --min=MB           Minimum memory size for mTHP (default: %d)\n"
+		"  -f, --fixed            Use fixed order, disable auto-adjustment\n"
+		"  -d, --debug            Enable debug output\n"
+		"  -h, --help             Show this help\n",
+		name, DEFAULT_THRESHOLD_MS, DEFAULT_INTERVAL_MS, DEFAULT_ORDER,
+		DEFAULT_MIN_MEM);
+}
+
+static void sig_handler(int sig)
+{
+	exiting = true;
+}
+
+static int setup_psi_trigger(const char *cgroup_path, const char *type,
+			     unsigned long stall_us, unsigned long window_us)
+{
+	char path[PATH_MAX];
+	char trigger[128];
+	int fd, nr;
+
+	snprintf(path, sizeof(path), "%s/memory.pressure", cgroup_path);
+	fd = open(path, O_RDWR | O_NONBLOCK);
+	if (fd < 0) {
+		fprintf(stderr, "ERROR: open PSI file failed\n");
+		return -errno;
+	}
+
+	nr = snprintf(trigger, sizeof(trigger), "%s %lu %lu",
+		      type, stall_us, window_us);
+	if (write(fd, trigger, nr) < 0) {
+		fprintf(stderr, "ERROR: write PSI trigger failed\n");
+		close(fd);
+		return -errno;
+	}
+
+	return fd;
+}
+
+static int trigger_scan(struct bpf_link *iter_link)
+{
+	char buf[256];
+	int fd;
+
+	fd = bpf_iter_create(bpf_link__fd(iter_link));
+	if (fd < 0) {
+		fprintf(stderr, "ERROR: bpf_iter_create failed: %s\n",
+			strerror(errno));
+		return -1;
+	}
+
+	/* Read to trigger the iter program execution */
+	while (read(fd, buf, sizeof(buf)))
+		;
+
+	close(fd);
+	return 0;
+}
+
+static void *monitor_thread(int psi_fd, struct config_local *configs,
+		struct bpf_link *iter_link, struct ring_buffer *rb)
+{
+	struct epoll_event e;
+	int epoll_fd;
+	int nfds;
+
+	epoll_fd = epoll_create1(0);
+	if (epoll_fd < 0) {
+		fprintf(stderr, "ERROR: epoll_create1 failed\n");
+		return NULL;
+	}
+
+	e.events = EPOLLPRI;
+	e.data.fd = psi_fd;
+	if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, psi_fd, &e)) {
+		fprintf(stderr, "ERROR: epoll_ctl failed\n");
+		goto CLOSE;
+	}
+
+	/* First initialization */
+	trigger_scan(iter_link);
+	if (configs->debug)
+		ring_buffer__poll(rb, 0);
+
+	/* Auto adjustment */
+	while (!exiting) {
+		nfds = epoll_wait(epoll_fd, &e, 1, configs->interval);
+		trigger_scan(iter_link);
+
+		if (configs->debug) {
+			printf("PSI: memory pressure %s\n", nfds ? "high" : "low");
+			ring_buffer__poll(rb, 0);
+		}
+	}
+
+CLOSE:
+	close(epoll_fd);
+	return NULL;
+}
+
+static int handle_event(void *ctx, void *data, size_t len)
+{
+	struct alert_event *e = data;
+
+	printf("cgroup %s: stall %lu -> %lu (+%lu), mem %luMB, mthp order=%d\n",
+		e->name[0] ? e->name : "/",
+		e->prev_stall, e->curr_stall, e->delta, TO_MB(e->mem), e->order);
+
+	return 0;
+}
+
+int main(int argc, char **argv)
+{
+	const char *root_path = DEFAULT_ROOT;
+	unsigned long threshold = DEFAULT_THRESHOLD_MS;
+	unsigned long interval = DEFAULT_INTERVAL_MS;
+	unsigned int init_order = DEFAULT_ORDER;
+	unsigned int min_mem = DEFAULT_MIN_MEM;
+	bool fixed = false;
+	bool debug = false;
+	struct mthp_ext *skel;
+	struct bpf_link *iter_link;
+	struct bpf_link *ops_link;
+	struct ring_buffer *rb;
+	int root_fd;
+	int psi_fd;
+	int err = 0;
+	int opt;
+
+	static struct option long_options[] = {
+		{"root",       required_argument, 0, 'r'},
+		{"threshold",  required_argument, 0, 't'},
+		{"interval",   required_argument, 0, 'i'},
+		{"order",      required_argument, 0, 'o'},
+		{"min",        required_argument, 0, 'm'},
+		{"fixed",      no_argument,       0, 'f'},
+		{"debug",      no_argument,       0, 'd'},
+		{"help",       no_argument,       0, 'h'},
+		{0, 0, 0, 0}
+	};
+
+	while ((opt = getopt_long(argc, argv, "r:t:i:o:m:fdh",
+				  long_options, NULL)) != -1) {
+		switch (opt) {
+		case 'r':
+			root_path = optarg;
+			break;
+		case 't':
+			threshold = strtoul(optarg, NULL, 10);
+			break;
+		case 'i':
+			interval = strtoul(optarg, NULL, 10);
+			break;
+		case 'o':
+			init_order = min(strtoul(optarg, NULL, 10), PMD_ORDER);
+			break;
+		case 'm':
+			min_mem = strtoul(optarg, NULL, 10);
+			break;
+		case 'f':
+			fixed = true;
+			break;
+		case 'd':
+			debug = true;
+			break;
+		case 'h':
+			usage(argv[0]);
+			return 0;
+		default:
+			usage(argv[0]);
+			return -EINVAL;
+		}
+	}
+
+	if (!threshold || !interval) {
+		fprintf(stderr, "ERROR: threshold and interval must be > 0\n");
+		usage(argv[0]);
+		return -EINVAL;
+	}
+
+	signal(SIGINT, sig_handler);
+	signal(SIGTERM, sig_handler);
+
+	root_fd = open(root_path, O_RDONLY);
+	if (root_fd < 0) {
+		fprintf(stderr, "ERROR: open '%s' failed: %s\n",
+			root_path, strerror(errno));
+		return -errno;
+	}
+
+	skel = mthp_ext__open();
+	if (!skel) {
+		fprintf(stderr, "ERROR: failed to open BPF skeleton\n");
+		err = -ENOMEM;
+		goto open_skel_fail;
+	}
+
+	skel->bss->configs.threshold = threshold;
+	skel->bss->configs.interval = interval;
+	skel->bss->configs.init_order = init_order;
+	skel->bss->configs.min_mem = min_mem;
+	skel->bss->configs.fixed = fixed;
+	skel->bss->configs.debug = debug;
+
+	err = mthp_ext__load(skel);
+	if (err) {
+		fprintf(stderr, "ERROR: failed to load BPF program: %d\n", err);
+		goto load_skel_fail;
+	}
+
+	/* Attach struct_ops to root cgroup for mthp_choose */
+	DECLARE_LIBBPF_OPTS(bpf_struct_ops_opts, opts);
+	opts.flags = BPF_F_CGROUP_FD;
+	opts.target_fd = root_fd;
+	ops_link = bpf_map__attach_struct_ops_opts(skel->maps.mthp_ops, &opts);
+	err = libbpf_get_error(ops_link);
+	if (err) {
+		fprintf(stderr, "ERROR: attach struct_ops failed: %d\n", err);
+		ops_link = NULL;
+		goto attach_opts_fail;
+	}
+
+	printf("Monitoring         : %s\n"
+	       "threshold          : %lums\n"
+	       "Interval           : %lums\n"
+	       "Initial order      : %d%s\n"
+	       "min memory         : %dMB\n"
+	       "Debug              : %s\n"
+	       "Press Ctrl+C to exit.\n\n",
+	       root_path, threshold, interval, init_order,
+	       fixed ? " (fixed)" : " (auto)", min_mem,
+	       debug ? "on" : "off");
+
+	if (fixed) {
+		while (!exiting)
+			usleep(interval * 1000);
+		goto exit_fixed;
+	}
+
+	/* Auto adjustment, attach cgroup iter for scanning root + descendants */
+	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, iter_opts);
+	union bpf_iter_link_info linfo = {
+		.cgroup.cgroup_fd = root_fd,
+		.cgroup.order = BPF_CGROUP_ITER_DESCENDANTS_PRE,
+	};
+	iter_opts.link_info = &linfo;
+	iter_opts.link_info_len = sizeof(linfo);
+	iter_link = bpf_program__attach_iter(skel->progs.cgroup_scan, &iter_opts);
+	err = libbpf_get_error(iter_link);
+	if (err) {
+		fprintf(stderr, "ERROR: attach cgroup iter failed: %d\n", err);
+		iter_link = NULL;
+		goto attach_iter_fail;
+	}
+
+	/* Set up ring buffer for receiving alerts */
+	rb = ring_buffer__new(bpf_map__fd(skel->maps.events),
+			      handle_event, NULL, NULL);
+	if (!rb) {
+		fprintf(stderr, "ERROR: failed to create ring buffer\n");
+		err = -ENOMEM;
+		goto rb_fail;
+	}
+
+
+	psi_fd = setup_psi_trigger(root_path, "some", threshold * 1000,
+				   interval * 1000);
+	if (psi_fd < 0) {
+		fprintf(stderr, "ERROR: PSI trigger setup failed\n");
+		goto psi_setup_fail;
+	}
+
+	monitor_thread(psi_fd, &skel->bss->configs, iter_link, rb);
+
+	close(psi_fd);
+psi_setup_fail:
+	ring_buffer__free(rb);
+rb_fail:
+	bpf_link__destroy(iter_link);
+exit_fixed:
+attach_iter_fail:
+	bpf_link__destroy(ops_link);
+attach_opts_fail:
+load_skel_fail:
+	mthp_ext__destroy(skel);
+open_skel_fail:
+	close(root_fd);
+
+	printf("\nExiting...\n");
+
+	return err;
+}
diff --git a/samples/bpf/mthp_ext.h b/samples/bpf/mthp_ext.h
new file mode 100644
index 000000000000..33dc01bcebd3
--- /dev/null
+++ b/samples/bpf/mthp_ext.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __MTHP_EXT_H__
+#define __MTHP_EXT_H__
+
+#define CGROUP_NAME_LEN 128
+#define PMD_ORDER	9
+#define min(a, b)	((a) < (b) ? a : b)
+#define FROM_MB(s)	(s * 1024 * 1024)
+#define TO_MB(s)	(s / 1024 / 1024)
+
+struct config_local {
+	unsigned long threshold;
+	unsigned long interval;
+	unsigned int  init_order;
+	unsigned int  min_mem;
+	bool fixed;
+	bool debug;
+};
+
+struct alert_event {
+	unsigned long prev_stall;
+	unsigned long curr_stall;
+	unsigned long delta;
+	unsigned long mem;
+	unsigned int  order;
+	char name[CGROUP_NAME_LEN];
+};
+
+#endif /* __MTHP_EXT_H__ */
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/4] samples: bpf: add mthp_ext
  2026-05-03 16:50 ` [PATCH 4/4] samples: bpf: add mthp_ext Vernon Yang
@ 2026-05-03 17:35   ` bot+bpf-ci
  2026-05-06 13:30     ` Vernon Yang
       [not found]   ` <20260503175737.6190AC2BCB4@smtp.kernel.org>
  1 sibling, 1 reply; 14+ messages in thread
From: bot+bpf-ci @ 2026-05-03 17:35 UTC (permalink / raw)
  To: vernon2gm, akpm, david, ljs, roman.gushchin, inwardvessel,
	shakeel.butt, ast, daniel, surenb
  Cc: linux-kernel, linux-mm, bpf, baohua, lance.yang, dev.jain,
	yanglincheng, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 3740 bytes --]

> diff --git a/samples/bpf/.gitignore b/samples/bpf/.gitignore
> index 0002cd359fb1..2a73581876b4 100644
> --- a/samples/bpf/.gitignore
> +++ b/samples/bpf/.gitignore

[ ... ]

> diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> index 95a4fa1f1e44..357c7d1c45ef 100644
> --- a/samples/bpf/Makefile
> +++ b/samples/bpf/Makefile

[ ... ]

> diff --git a/samples/bpf/mthp_ext.bpf.c b/samples/bpf/mthp_ext.bpf.c
> new file mode 100644
> index 000000000000..bbee3e9f679c

[ ... ]

> diff --git a/samples/bpf/mthp_ext.c b/samples/bpf/mthp_ext.c
> new file mode 100644
> index 000000000000..0e064bad136f
> --- /dev/null
> +++ b/samples/bpf/mthp_ext.c
> @@ -0,0 +1,340 @@
> +// SPDX-License-Identifier: GPL-2.0

[ ... ]

> +int main(int argc, char **argv)
> +{
> +	const char *root_path = DEFAULT_ROOT;
> +	unsigned long threshold = DEFAULT_THRESHOLD_MS;
> +	unsigned long interval = DEFAULT_INTERVAL_MS;
> +	unsigned int init_order = DEFAULT_ORDER;
> +	unsigned int min_mem = DEFAULT_MIN_MEM;
> +	bool fixed = false;
> +	bool debug = false;
> +	struct mthp_ext *skel;
> +	struct bpf_link *iter_link;
> +	struct bpf_link *ops_link;
> +	struct ring_buffer *rb;
> +	int root_fd;
> +	int psi_fd;
> +	int err = 0;
> +	int opt;

[ ... ]

> +	/* Attach struct_ops to root cgroup for mthp_choose */
> +	DECLARE_LIBBPF_OPTS(bpf_struct_ops_opts, opts);
> +	opts.flags = BPF_F_CGROUP_FD;
> +	opts.target_fd = root_fd;
> +	ops_link = bpf_map__attach_struct_ops_opts(skel->maps.mthp_ops, &opts);
                                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
Will mthp_ext compile with these symbols? A tree-wide grep shows that
bpf_struct_ops_opts, BPF_F_CGROUP_FD, and bpf_map__attach_struct_ops_opts()
do not exist anywhere in the kernel source tree:

  $ git grep -n "struct bpf_struct_ops_opts"
  samples/bpf/mthp_ext.c:260:	DECLARE_LIBBPF_OPTS(bpf_struct_ops_opts, opts);

  $ git grep -n "BPF_F_CGROUP_FD"
  samples/bpf/mthp_ext.c:261:	opts.flags = BPF_F_CGROUP_FD;

  $ git grep -n "bpf_map__attach_struct_ops_opts"
  samples/bpf/mthp_ext.c:263:	ops_link = bpf_map__attach_struct_ops_opts(skel->maps.mthp_ops, &opts);

The struct bpf_struct_ops_opts type is not defined in tools/lib/bpf/libbpf.h.
The BPF_F_CGROUP_FD flag does not appear in include/uapi/linux/bpf.h or any
other header. The bpf_map__attach_struct_ops_opts() function is not declared
in libbpf.h or exported in tools/lib/bpf/libbpf.map.

The only existing libbpf API for struct_ops attachment is:

  tools/lib/bpf/libbpf.h:
    LIBBPF_API struct bpf_link *
    bpf_map__attach_struct_ops(const struct bpf_map *map);

which takes no opts parameter and cannot attach to a specific cgroup fd.

Samples/bpf/Makefile was modified earlier in this commit to add mthp_ext to
tprogs-y, meaning the build system will attempt to compile mthp_ext.c during
make samples/bpf/. That compilation will fail with undefined type, undefined
identifier, and implicit declaration errors.

This looks like it depends on an out-of-tree libbpf extension that adds a
per-cgroup struct_ops attachment API. Is that libbpf patch supposed to be
included earlier in this series, or should the sample use a different attach
mechanism?

> +	err = libbpf_get_error(ops_link);
> +	if (err) {
> +		fprintf(stderr, "ERROR: attach struct_ops failed: %d\n", err);
> +		ops_link = NULL;
> +		goto attach_opts_fail;
> +	}

[ ... ]

> diff --git a/samples/bpf/mthp_ext.h b/samples/bpf/mthp_ext.h
> new file mode 100644
> index 000000000000..33dc01bcebd3

[ ... ]


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/25285279315

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 4/4] samples: bpf: add mthp_ext
  2026-05-03 17:35   ` bot+bpf-ci
@ 2026-05-06 13:30     ` Vernon Yang
  0 siblings, 0 replies; 14+ messages in thread
From: Vernon Yang @ 2026-05-06 13:30 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: akpm, david, ljs, roman.gushchin, inwardvessel, shakeel.butt, ast,
	daniel, surenb, linux-kernel, linux-mm, bpf, baohua, lance.yang,
	dev.jain, yanglincheng, andrii, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

On Mon, May 4, 2026 at 1:35 AM <bot+bpf-ci@kernel.org> wrote:
>
> > diff --git a/samples/bpf/.gitignore b/samples/bpf/.gitignore
> > index 0002cd359fb1..2a73581876b4 100644
> > --- a/samples/bpf/.gitignore
> > +++ b/samples/bpf/.gitignore
>
> [ ... ]
>
> > diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
> > index 95a4fa1f1e44..357c7d1c45ef 100644
> > --- a/samples/bpf/Makefile
> > +++ b/samples/bpf/Makefile
>
> [ ... ]
>
> > diff --git a/samples/bpf/mthp_ext.bpf.c b/samples/bpf/mthp_ext.bpf.c
> > new file mode 100644
> > index 000000000000..bbee3e9f679c
>
> [ ... ]
>
> > diff --git a/samples/bpf/mthp_ext.c b/samples/bpf/mthp_ext.c
> > new file mode 100644
> > index 000000000000..0e064bad136f
> > --- /dev/null
> > +++ b/samples/bpf/mthp_ext.c
> > @@ -0,0 +1,340 @@
> > +// SPDX-License-Identifier: GPL-2.0
>
> [ ... ]
>
> > +int main(int argc, char **argv)
> > +{
> > +     const char *root_path = DEFAULT_ROOT;
> > +     unsigned long threshold = DEFAULT_THRESHOLD_MS;
> > +     unsigned long interval = DEFAULT_INTERVAL_MS;
> > +     unsigned int init_order = DEFAULT_ORDER;
> > +     unsigned int min_mem = DEFAULT_MIN_MEM;
> > +     bool fixed = false;
> > +     bool debug = false;
> > +     struct mthp_ext *skel;
> > +     struct bpf_link *iter_link;
> > +     struct bpf_link *ops_link;
> > +     struct ring_buffer *rb;
> > +     int root_fd;
> > +     int psi_fd;
> > +     int err = 0;
> > +     int opt;
>
> [ ... ]
>
> > +     /* Attach struct_ops to root cgroup for mthp_choose */
> > +     DECLARE_LIBBPF_OPTS(bpf_struct_ops_opts, opts);
> > +     opts.flags = BPF_F_CGROUP_FD;
> > +     opts.target_fd = root_fd;
> > +     ops_link = bpf_map__attach_struct_ops_opts(skel->maps.mthp_ops, &opts);
>                                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
> Will mthp_ext compile with these symbols? A tree-wide grep shows that
> bpf_struct_ops_opts, BPF_F_CGROUP_FD, and bpf_map__attach_struct_ops_opts()
> do not exist anywhere in the kernel source tree:
>
>   $ git grep -n "struct bpf_struct_ops_opts"
>   samples/bpf/mthp_ext.c:260:   DECLARE_LIBBPF_OPTS(bpf_struct_ops_opts, opts);
>
>   $ git grep -n "BPF_F_CGROUP_FD"
>   samples/bpf/mthp_ext.c:261:   opts.flags = BPF_F_CGROUP_FD;
>
>   $ git grep -n "bpf_map__attach_struct_ops_opts"
>   samples/bpf/mthp_ext.c:263:   ops_link = bpf_map__attach_struct_ops_opts(skel->maps.mthp_ops, &opts);
>
> The struct bpf_struct_ops_opts type is not defined in tools/lib/bpf/libbpf.h.
> The BPF_F_CGROUP_FD flag does not appear in include/uapi/linux/bpf.h or any
> other header. The bpf_map__attach_struct_ops_opts() function is not declared
> in libbpf.h or exported in tools/lib/bpf/libbpf.map.
>
> The only existing libbpf API for struct_ops attachment is:
>
>   tools/lib/bpf/libbpf.h:
>     LIBBPF_API struct bpf_link *
>     bpf_map__attach_struct_ops(const struct bpf_map *map);
>
> which takes no opts parameter and cannot attach to a specific cgroup fd.
>
> Samples/bpf/Makefile was modified earlier in this commit to add mthp_ext to
> tprogs-y, meaning the build system will attempt to compile mthp_ext.c during
> make samples/bpf/. That compilation will fail with undefined type, undefined
> identifier, and implicit declaration errors.
>
> This looks like it depends on an out-of-tree libbpf extension that adds a
> per-cgroup struct_ops attachment API. Is that libbpf patch supposed to be
> included earlier in this series, or should the sample use a different attach
> mechanism?

This series is based on linux v7.1-rc1 (26fd6bff2c05) + "mm: BPF OOM"[3]
first four patches.

All of the above definitions are in  "mm: BPF OOM" PATCH#2.

https://lore.kernel.org/linux-mm/20260127024421.494929-3-roman.gushchin@linux.dev/

> > +     err = libbpf_get_error(ops_link);
> > +     if (err) {
> > +             fprintf(stderr, "ERROR: attach struct_ops failed: %d\n", err);
> > +             ops_link = NULL;
> > +             goto attach_opts_fail;
> > +     }
>
> [ ... ]
>
> > diff --git a/samples/bpf/mthp_ext.h b/samples/bpf/mthp_ext.h
> > new file mode 100644
> > index 000000000000..33dc01bcebd3
>
> [ ... ]
>
>
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
>
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/25285279315

^ permalink raw reply	[flat|nested] 14+ messages in thread

[parent not found: <20260503175737.6190AC2BCB4@smtp.kernel.org>]

* Re: [PATCH 4/4] samples: bpf: add mthp_ext
       [not found]   ` <20260503175737.6190AC2BCB4@smtp.kernel.org>
@ 2026-05-06 13:50     ` Vernon Yang
  0 siblings, 0 replies; 14+ messages in thread
From: Vernon Yang @ 2026-05-06 13:50 UTC (permalink / raw)
  To: sashiko
  Cc: bpf, Andrew Morton, David Hildenbrand (Red Hat),
	Lorenzo Stoakes (Oracle), Roman Gushchin, inwardvessel,
	Shakeel Butt, ast, Daniel Borkmann, surenb, linux-kernel,
	linux-mm, baohua, Lance Yang, Dev Jain

Forwarding sashiko review

On Mon, May 4, 2026 at 1:57 AM <sashiko-bot@kernel.org> wrote:
>
> Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:
> - [High] Infinite loop in `trigger_scan()` when `read()` returns an error
> - [Medium] Integer overflow in `FROM_MB()` macro disables fallback logic for values >= 4096
> - [Medium] Spurious mTHP fallback during initial cgroup scan due to zero-initialized baseline
> - [Medium] Failure in `setup_psi_trigger()` masks the error code and exits successfully
> --
>
> commit d30597e7a4d6ca7d5532535074e0c66c2c92f486
> Author: Vernon Yang <yanglincheng@kylinos.cn>
>
> samples: bpf: add mthp_ext
>
> This commit adds a new BPF sample program, mthp_ext, to adjust mTHP sizes
> based on cgroup memory pressure and usage. It monitors the cgroup directory
> and automatically falls back to 4KB sizes during high memory pressure or
> low usage.
>
> > diff --git a/samples/bpf/mthp_ext.h b/samples/bpf/mthp_ext.h
> > new file mode 100644
> > index 000000000000..33dc01bcebd3
> > --- /dev/null
> > +++ b/samples/bpf/mthp_ext.h
> > @@ -0,0 +1,30 @@
> [ ... ]
> > +#define min(a, b)    ((a) < (b) ? a : b)
> > +#define FROM_MB(s)   (s * 1024 * 1024)
>
> Does this macro suffer from a 32-bit integer overflow if s is 4096 or
> larger?
>
> If s is an unsigned int like configs.min_mem, 4096 * 1048576 would wrap to 0,
> which could silently disable the fallback logic.

Good catch, I will fix that in the next version. Thanks!

> > +#define TO_MB(s)     (s / 1024 / 1024)
> > +
> > +struct config_local {
> > +     unsigned long threshold;
> > +     unsigned long interval;
> > +     unsigned int  init_order;
> > +     unsigned int  min_mem;
> [ ... ]
> > diff --git a/samples/bpf/mthp_ext.bpf.c b/samples/bpf/mthp_ext.bpf.c
> > new file mode 100644
> > index 000000000000..bbee3e9f679c
> > --- /dev/null
> > +++ b/samples/bpf/mthp_ext.bpf.c
> [ ... ]
> > +SEC("iter.s/cgroup")
> > +int cgroup_scan(struct bpf_iter__cgroup *ctx)
> > +{
> > +     struct cgroup *cgrp = ctx->cgroup;
> > +     struct mem_cgroup *memcg;
> > +     struct mem_info *info;
> > +     struct alert_event *e;
> > +     unsigned long curr_stall;
> > +     unsigned long curr_mem;
> > +     unsigned long delta;
> > +
> > +     if (!cgrp)
> > +             return 1;
> > +
> > +     if (!cgroup_has_tasks(cgrp))
> > +             return 0;
> > +
> > +     info = bpf_cgrp_storage_get(&cgrp_storage, cgrp, 0,
> > +                                 BPF_LOCAL_STORAGE_GET_F_CREATE);
> > +     if (!info)
> > +             return 0;
> > +
> > +     memcg = bpf_get_mem_cgroup(&cgrp->self);
> > +     if (!memcg)
> > +             return 0;
> > +
> > +     bpf_cgroup_flush_stats(cgrp);
> > +     curr_stall = bpf_cgroup_stall(cgrp, PSI_MEM_FULL);
> > +     delta = curr_stall - info->stall;
>
> During the initial scan of a cgroup, info->stall will be zero-initialized.
> Since curr_stall represents cumulative stall time, won't delta be a very
> large number?
>
> This seems like it could incorrectly trigger the 4KB fallback during the
> first interval regardless of current memory pressure.

Good catch, This is a Bug exactly. I will fix that in the next version. Thanks!

> Copy from cover-letter:
> When the redis is under no memory pressure, RPS drops by 10.3%
> (from 1.4M to 1.2M, compared to always).

After fixed, when there is no memory pressure on Redis, RPS equal to
1.4M, which means mthp_ext is a positive optimization in all
scenarios.

> > +     bpf_mem_cgroup_flush_stats(memcg);
> > +     curr_mem = bpf_mem_cgroup_page_state(memcg, NR_ANON_MAPPED) +
> > +                bpf_mem_cgroup_page_state(memcg, NR_SHMEM);
> > +     if (curr_mem < FROM_MB(configs.min_mem) || delta >= configs.threshold)
> > +             info->order = 0;
> > +     else
> > +             info->order = PMD_ORDER;
> [ ... ]
> > diff --git a/samples/bpf/mthp_ext.c b/samples/bpf/mthp_ext.c
> > new file mode 100644
> > index 000000000000..0e064bad136f
> > --- /dev/null
> > +++ b/samples/bpf/mthp_ext.c
> [ ... ]
> > +static int trigger_scan(struct bpf_link *iter_link)
> > +{
> > +     char buf[256];
> > +     int fd;
> > +
> > +     fd = bpf_iter_create(bpf_link__fd(iter_link));
> > +     if (fd < 0) {
> > +             fprintf(stderr, "ERROR: bpf_iter_create failed: %s\n",
> > +                     strerror(errno));
> > +             return -1;
> > +     }
> > +
> > +     /* Read to trigger the iter program execution */
> > +     while (read(fd, buf, sizeof(buf)))
> > +             ;
>
> If read() returns an error (like -1 for EINTR), will this result in an infinite
> busy-wait loop since -1 evaluates to true in C?
>
> Could this be updated to check for > 0 instead?

Yes, I will fix that in the next version. Thanks!

> > +
> > +     close(fd);
> > +     return 0;
> > +}
> [ ... ]
> > +int main(int argc, char **argv)
> > +{
> [ ... ]
> > +     psi_fd = setup_psi_trigger(root_path, "some", threshold * 1000,
> > +                                interval * 1000);
> > +     if (psi_fd < 0) {
> > +             fprintf(stderr, "ERROR: PSI trigger setup failed\n");
> > +             goto psi_setup_fail;
> > +     }
> > +
> > +     monitor_thread(psi_fd, &skel->bss->configs, iter_link, rb);
> > +
> > +     close(psi_fd);
> > +psi_setup_fail:
> > +     ring_buffer__free(rb);
> > +rb_fail:
> > +     bpf_link__destroy(iter_link);
> > +exit_fixed:
> > +attach_iter_fail:
> > +     bpf_link__destroy(ops_link);
> > +attach_opts_fail:
> > +load_skel_fail:
> > +     mthp_ext__destroy(skel);
> > +open_skel_fail:
> > +     close(root_fd);
> > +
> > +     printf("\nExiting...\n");
> > +
> > +     return err;
> > +}
>
> If setup_psi_trigger() fails, the code jumps to psi_setup_fail. However, err
> was previously set to 0 from a successful BPF operation.
>
> Does this cause the program to log an error but exit with a success status
> code?

Yes, I will fix that in the next version. Thanks!

> --
> Sashiko AI review · https://sashiko.dev/#/patchset/20260503165024.1526680-1-vernon2gm@gmail.com?part=4

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2026-05-06 13:50 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-03 16:50 [PATCH 0/4] mm: introduce mthp_ext via cgroup-bpf to make mTHP more transparent Vernon Yang
2026-05-03 16:50 ` [PATCH 1/4] psi: add psi_group_flush_stats() function Vernon Yang
2026-05-03 16:50 ` [PATCH 2/4] bpf: add bpf_cgroup_{flush_stats,stall} function Vernon Yang
2026-05-03 17:23   ` bot+bpf-ci
2026-05-06 12:38     ` Vernon Yang
     [not found]   ` <20260503172520.376AAC2BCB4@smtp.kernel.org>
2026-05-06 12:55     ` Vernon Yang
2026-05-03 16:50 ` [PATCH 3/4] mm: introduce bpf_mthp_ops struct ops Vernon Yang
2026-05-03 17:35   ` bot+bpf-ci
2026-05-06 13:06     ` Vernon Yang
     [not found]   ` <20260503174125.2C949C2BCB4@smtp.kernel.org>
2026-05-06 13:26     ` Vernon Yang
2026-05-03 16:50 ` [PATCH 4/4] samples: bpf: add mthp_ext Vernon Yang
2026-05-03 17:35   ` bot+bpf-ci
2026-05-06 13:30     ` Vernon Yang
     [not found]   ` <20260503175737.6190AC2BCB4@smtp.kernel.org>
2026-05-06 13:50     ` Vernon Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox