[PATCH v6 mm-new 00/10] mm, bpf: BPF based THP order selection

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v6 mm-new 00/10] mm, bpf: BPF based THP order selection
@ 2025-08-26  7:19 Yafang Shao
  2025-08-26  7:19 ` [PATCH v6 mm-new 01/10] mm: thp: add support for " Yafang Shao
                   ` (11 more replies)
  0 siblings, 12 replies; 61+ messages in thread
From: Yafang Shao @ 2025-08-26  7:19 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet
  Cc: bpf, linux-mm, linux-doc, Yafang Shao

Background
==========

Our production servers consistently configure THP to "never" due to
historical incidents caused by its behavior. Key issues include:
- Increased Memory Consumption
  THP significantly raises overall memory usage, reducing available memory
  for workloads.

- Latency Spikes
  Random latency spikes occur due to frequent memory compaction triggered
  by THP.

- Lack of Fine-Grained Control
  THP tuning is globally configured, making it unsuitable for containerized
  environments. When multiple workloads share a host, enabling THP without
  per-workload control leads to unpredictable behavior.

Due to these issues, administrators avoid switching to madvise or always
modes—unless per-workload THP control is implemented.

To address this, we propose BPF-based THP policy for flexible adjustment.
Additionally, as David mentioned [0], this mechanism can also serve as a
policy prototyping tool (test policies via BPF before upstreaming them).

Proposed Solution
=================

As suggested by David [0], we introduce a new BPF interface:

/**
 * @get_suggested_order: Get the suggested THP orders for allocation
 * @mm: mm_struct associated with the THP allocation
 * @vma__nullable: vm_area_struct associated with the THP allocation (may be NULL)
 *                 When NULL, the decision should be based on @mm (i.e., when
 *                 triggered from an mm-scope hook rather than a VMA-specific
 *                 context).
 *                 Must belong to @mm (guaranteed by the caller).
 * @vma_flags: use these vm_flags instead of @vma->vm_flags (0 if @vma is NULL)
 * @tva_flags: TVA flags for current @vma (-1 if @vma is NULL)
 * @orders: Bitmask of requested THP orders for this allocation
 *          - PMD-mapped allocation if PMD_ORDER is set
 *          - mTHP allocation otherwise
 *
 * Rerurn: Bitmask of suggested THP orders for allocation. The highest
 *         suggested order will not exceed the highest requested order
 *         in @orders.
 */
 int (*get_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
                            u64 vma_flags, enum tva_type tva_flags, int orders) __rcu;

This interface:
- Supports both use cases (per-workload tuning + policy prototyping).
- Can be extended with BPF helpers (e.g., for memory pressure awareness).

This is an experimental feature. To use it, you must enable
CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION.

Warning:
- The interface may change
- Behavior may differ in future kernel versions
- We might remove it in the future


Selftests
=========

BPF selftests
-------------

Patch #5: Implements a basic BPF THP policy that restricts THP allocation
          via khugepaged to tasks within a specified memory cgroup.
Patch #6: Contains test cases validating the khugepaged fork behavior.
Patch #7: Provides tests for dynamic BPF program updates and replacement.
Patch #8: Includes negative tests for invalid BPF helper usage, verifying
          proper verification by the BPF verifier.

Currently, several dependency patches reside in mm-new but haven't been
merged into bpf-next:
  mm: add bitmap mm->flags field
  mm/huge_memory: convert "tva_flags" to "enum tva_type"
  mm: convert core mm to mm_flags_*() accessors

To enable BPF CI testing, these dependencies were manually applied to
bpf-next [1]. All selftests in this series pass successfully. The observed
CI failures are unrelated to these changes.

Performance Evaluation
----------------------

As suggested by Usama [2], performance impact was measured given the page
fault handler modifications. The standard `perf bench mem memset` benchmark
was employed to assess page fault performance.

Testing was conducted on an AMD EPYC 7W83 64-Core Processor (single NUMA
node). Due to variance between individual test runs, a script executed
10000 iterations to calculate meaningful averages and standard deviations.

The results across three configurations show negligible performance impact:
- Baseline (without this patch series)
- With patch series but no BPF program attached
- With patch series and BPF program attached

The result are as follows,

  Number of runs: 10,000
  Average throughput: 40-41 GB/sec
  Standard deviation: 7-8 GB/sec

Production verification
-----------------------

We have successfully deployed a variant of this approach across numerous
Kubernetes production servers. The implementation enables THP for specific
workloads (such as applications utilizing ZGC [3]) while disabling it for
others. This selective deployment has operated flawlessly, with no
regression reports to date.

For ZGC-based applications, our verification demonstrates that shmem THP
delivers significant improvements:
- Reduced CPU utilization
- Lower average latencies

Future work
===========

Based on our validation with production workloads, we observed mixed
results with XFS large folios (also known as File THP):

- Performance Benefits
  Some workloads demonstrated significant improvements with XFS large
  folios enabled
- Performance Regression
  Some workloads experienced degradation when using XFS large folios

These results demonstrate that File THP, similar to anonymous THP, requires
a more granular approach instead of a uniform implementation.

We will extend the BPF-based order selection mechanism to support File THP
allocation policies.

Link: https://lwn.net/ml/all/9bc57721-5287-416c-aa30-46932d605f63@redhat.com/ [0] 
Link: https://github.com/kernel-patches/bpf/pull/9561 [1]
Link: https://lwn.net/ml/all/a24d632d-4b11-4c88-9ed0-26fa12a0fce4@gmail.com/ [2]
Link: https://wiki.openjdk.org/display/zgc/Main#Main-EnablingTransparentHugePagesOnLinux [3]

Changes:
=======

RFC v5-> v6:
- Code improvement around the RCU usage (Usama)
- Add selftests for khugepaged fork (Usama)
- Add performance data for page fault (Usama)
- Remove the RFC tag

RFC v4->v5: https://lwn.net/Articles/1034265/
- Add support for vma (David)
- Add mTHP support in khugepaged (Zi)
- Use bitmask of all allowed orders instead (Zi)
- Retrieve the page size and PMD order rather than hardcoding them (Zi)

RFC v3->v4: https://lwn.net/Articles/1031829/
- Use a new interface get_suggested_order() (David)
- Mark it as experimental (David, Lorenzo)
- Code improvement in THP (Usama)
- Code improvement in BPF struct ops (Amery)

RFC v2->v3: https://lwn.net/Articles/1024545/
- Finer-graind tuning based on madvise or always mode (David, Lorenzo)
- Use BPF to write more advanced policies logic (David, Lorenzo)

RFC v1->v2: https://lwn.net/Articles/1021783/
The main changes are as follows,
- Use struct_ops instead of fmod_ret (Alexei)
- Introduce a new THP mode (Johannes)
- Introduce new helpers for BPF hook (Zi)
- Refine the commit log

RFC v1: https://lwn.net/Articles/1019290/

Yafang Shao (10):
  mm: thp: add support for BPF based THP order selection
  mm: thp: add a new kfunc bpf_mm_get_mem_cgroup()
  mm: thp: add a new kfunc bpf_mm_get_task()
  bpf: mark vma->vm_mm as trusted
  selftests/bpf: add a simple BPF based THP policy
  selftests/bpf: add test case for khugepaged fork
  selftests/bpf: add test case to update thp policy
  selftests/bpf: add test cases for invalid thp_adjust usage
  Documentation: add BPF-based THP adjustment documentation
  MAINTAINERS: add entry for BPF-based THP adjustment

 Documentation/admin-guide/mm/transhuge.rst    |  47 +++
 MAINTAINERS                                   |  10 +
 include/linux/huge_mm.h                       |  15 +
 include/linux/khugepaged.h                    |  12 +-
 kernel/bpf/verifier.c                         |   5 +
 mm/Kconfig                                    |  12 +
 mm/Makefile                                   |   1 +
 mm/bpf_thp.c                                  | 269 ++++++++++++++
 mm/huge_memory.c                              |  10 +
 mm/khugepaged.c                               |  26 +-
 mm/memory.c                                   |  18 +-
 tools/testing/selftests/bpf/config            |   3 +
 .../selftests/bpf/prog_tests/thp_adjust.c     | 343 ++++++++++++++++++
 .../selftests/bpf/progs/test_thp_adjust.c     | 115 ++++++
 .../bpf/progs/test_thp_adjust_trusted_vma.c   |  27 ++
 .../progs/test_thp_adjust_unreleased_memcg.c  |  24 ++
 .../progs/test_thp_adjust_unreleased_task.c   |  25 ++
 17 files changed, 955 insertions(+), 7 deletions(-)
 create mode 100644 mm/bpf_thp.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_trusted_vma.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_unreleased_memcg.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_unreleased_task.c

-- 
2.47.3



^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v6 mm-new 01/10] mm: thp: add support for BPF based THP order selection
  2025-08-26  7:19 [PATCH v6 mm-new 00/10] mm, bpf: BPF based THP order selection Yafang Shao
@ 2025-08-26  7:19 ` Yafang Shao
  2025-08-27  2:57   ` kernel test robot
                     ` (2 more replies)
  2025-08-26  7:19 ` [PATCH v6 mm-new 02/10] mm: thp: add a new kfunc bpf_mm_get_mem_cgroup() Yafang Shao
                   ` (10 subsequent siblings)
  11 siblings, 3 replies; 61+ messages in thread
From: Yafang Shao @ 2025-08-26  7:19 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet
  Cc: bpf, linux-mm, linux-doc, Yafang Shao

This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
THP tuning. It includes a hook get_suggested_order() [0], allowing BPF
programs to influence THP order selection based on factors such as:
- Workload identity
  For example, workloads running in specific containers or cgroups.
- Allocation context
  Whether the allocation occurs during a page fault, khugepaged, or other
  paths.
- System memory pressure
  (May require new BPF helpers to accurately assess memory pressure.)

Key Details:
- Only one BPF program can be attached at a time, but it can be updated
  dynamically to adjust the policy.
- Supports automatic mTHP order selection and per-workload THP policies.
- Only functional when THP is set to madise or always.

It requires CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION to enable. [1]
This feature is unstable and may evolve in future kernel versions.

Link: https://lwn.net/ml/all/9bc57721-5287-416c-aa30-46932d605f63@redhat.com/ [0]
Link: https://lwn.net/ml/all/dda67ea5-2943-497c-a8e5-d81f0733047d@lucifer.local/ [1]

Suggested-by: David Hildenbrand <david@redhat.com>
Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/huge_mm.h    |  15 +++
 include/linux/khugepaged.h |  12 ++-
 mm/Kconfig                 |  12 +++
 mm/Makefile                |   1 +
 mm/bpf_thp.c               | 186 +++++++++++++++++++++++++++++++++++++
 mm/huge_memory.c           |  10 ++
 mm/khugepaged.c            |  26 +++++-
 mm/memory.c                |  18 +++-
 8 files changed, 273 insertions(+), 7 deletions(-)
 create mode 100644 mm/bpf_thp.c

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 1ac0d06fb3c1..f0c91d7bd267 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -6,6 +6,8 @@
 
 #include <linux/fs.h> /* only for vma_is_dax() */
 #include <linux/kobject.h>
+#include <linux/pgtable.h>
+#include <linux/mm.h>
 
 vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
@@ -56,6 +58,7 @@ enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
 	TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
+	TRANSPARENT_HUGEPAGE_BPF_ATTACHED,      /* BPF prog is attached */
 };
 
 struct kobject;
@@ -195,6 +198,18 @@ static inline bool hugepage_global_always(void)
 			(1<<TRANSPARENT_HUGEPAGE_FLAG);
 }
 
+#ifdef CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION
+int get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
+			u64 vma_flags, enum tva_type tva_flags, int orders);
+#else
+static inline int
+get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
+		    u64 vma_flags, enum tva_type tva_flags, int orders)
+{
+	return orders;
+}
+#endif
+
 static inline int highest_order(unsigned long orders)
 {
 	return fls_long(orders) - 1;
diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index eb1946a70cff..d81c1228a21f 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -4,6 +4,8 @@
 
 #include <linux/mm.h>
 
+#include <linux/huge_mm.h>
+
 extern unsigned int khugepaged_max_ptes_none __read_mostly;
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 extern struct attribute_group khugepaged_attr_group;
@@ -22,7 +24,15 @@ extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 
 static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
 {
-	if (mm_flags_test(MMF_VM_HUGEPAGE, oldmm))
+	/*
+	 * THP allocation policy can be dynamically modified via BPF. Even if a
+	 * task was allowed to allocate THPs, BPF can decide whether its forked
+	 * child can allocate THPs.
+	 *
+	 * The MMF_VM_HUGEPAGE flag will be cleared by khugepaged.
+	 */
+	if (mm_flags_test(MMF_VM_HUGEPAGE, oldmm) &&
+		get_suggested_order(mm, NULL, 0, -1, BIT(PMD_ORDER)))
 		__khugepaged_enter(mm);
 }
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 4108bcd96784..d10089e3f181 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -924,6 +924,18 @@ config NO_PAGE_MAPCOUNT
 
 	  EXPERIMENTAL because the impact of some changes is still unclear.
 
+config EXPERIMENTAL_BPF_ORDER_SELECTION
+	bool "BPF-based THP order selection (EXPERIMENTAL)"
+	depends on TRANSPARENT_HUGEPAGE && BPF_SYSCALL
+
+	help
+	  Enable dynamic THP order selection using BPF programs. This
+	  experimental feature allows custom BPF logic to determine optimal
+	  transparent hugepage allocation sizes at runtime.
+
+	  Warning: This feature is unstable and may change in future kernel
+	  versions.
+
 endif # TRANSPARENT_HUGEPAGE
 
 # simple helper to make the code a bit easier to read
diff --git a/mm/Makefile b/mm/Makefile
index ef54aa615d9d..cb55d1509be1 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -99,6 +99,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_NUMA) += memory-tiers.o
 obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
+obj-$(CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION) += bpf_thp.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
 obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
 obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
new file mode 100644
index 000000000000..fbff3b1bb988
--- /dev/null
+++ b/mm/bpf_thp.c
@@ -0,0 +1,186 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/huge_mm.h>
+#include <linux/khugepaged.h>
+
+struct bpf_thp_ops {
+	/**
+	 * @get_suggested_order: Get the suggested THP orders for allocation
+	 * @mm: mm_struct associated with the THP allocation
+	 * @vma__nullable: vm_area_struct associated with the THP allocation (may be NULL)
+	 *                 When NULL, the decision should be based on @mm (i.e., when
+	 *                 triggered from an mm-scope hook rather than a VMA-specific
+	 *                 context).
+	 *                 Must belong to @mm (guaranteed by the caller).
+	 * @vma_flags: use these vm_flags instead of @vma->vm_flags (0 if @vma is NULL)
+	 * @tva_flags: TVA flags for current @vma (-1 if @vma is NULL)
+	 * @orders: Bitmask of requested THP orders for this allocation
+	 *          - PMD-mapped allocation if PMD_ORDER is set
+	 *          - mTHP allocation otherwise
+	 *
+	 * Rerurn: Bitmask of suggested THP orders for allocation. The highest
+	 *         suggested order will not exceed the highest requested order
+	 *         in @orders.
+	 */
+	int (*get_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
+				   u64 vma_flags, enum tva_type tva_flags, int orders) __rcu;
+};
+
+static struct bpf_thp_ops bpf_thp;
+static DEFINE_SPINLOCK(thp_ops_lock);
+
+int get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
+			u64 vma_flags, enum tva_type tva_flags, int orders)
+{
+	int (*bpf_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
+				   u64 vma_flags, enum tva_type tva_flags, int orders);
+	int suggested_orders = orders;
+
+	/* No BPF program is attached */
+	if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
+		      &transparent_hugepage_flags))
+		return suggested_orders;
+
+	rcu_read_lock();
+	bpf_suggested_order = rcu_dereference(bpf_thp.get_suggested_order);
+	if (!bpf_suggested_order)
+		goto out;
+
+	suggested_orders = bpf_suggested_order(mm, vma__nullable, vma_flags, tva_flags, orders);
+	if (highest_order(suggested_orders) > highest_order(orders))
+		suggested_orders = orders;
+
+out:
+	rcu_read_unlock();
+	return suggested_orders;
+}
+
+static bool bpf_thp_ops_is_valid_access(int off, int size,
+					enum bpf_access_type type,
+					const struct bpf_prog *prog,
+					struct bpf_insn_access_aux *info)
+{
+	return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
+}
+
+static const struct bpf_func_proto *
+bpf_thp_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	return bpf_base_func_proto(func_id, prog);
+}
+
+static const struct bpf_verifier_ops thp_bpf_verifier_ops = {
+	.get_func_proto = bpf_thp_get_func_proto,
+	.is_valid_access = bpf_thp_ops_is_valid_access,
+};
+
+static int bpf_thp_init(struct btf *btf)
+{
+	return 0;
+}
+
+static int bpf_thp_init_member(const struct btf_type *t,
+			       const struct btf_member *member,
+			       void *kdata, const void *udata)
+{
+	return 0;
+}
+
+static int bpf_thp_reg(void *kdata, struct bpf_link *link)
+{
+	struct bpf_thp_ops *ops = kdata;
+
+	spin_lock(&thp_ops_lock);
+	if (test_and_set_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
+			     &transparent_hugepage_flags)) {
+		spin_unlock(&thp_ops_lock);
+		return -EBUSY;
+	}
+	WARN_ON_ONCE(rcu_access_pointer(bpf_thp.get_suggested_order));
+	rcu_assign_pointer(bpf_thp.get_suggested_order, ops->get_suggested_order);
+	spin_unlock(&thp_ops_lock);
+	return 0;
+}
+
+static void bpf_thp_unreg(void *kdata, struct bpf_link *link)
+{
+	spin_lock(&thp_ops_lock);
+	clear_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, &transparent_hugepage_flags);
+	WARN_ON_ONCE(!rcu_access_pointer(bpf_thp.get_suggested_order));
+	rcu_replace_pointer(bpf_thp.get_suggested_order, NULL, lockdep_is_held(&thp_ops_lock));
+	spin_unlock(&thp_ops_lock);
+
+	synchronize_rcu();
+}
+
+static int bpf_thp_update(void *kdata, void *old_kdata, struct bpf_link *link)
+{
+	struct bpf_thp_ops *ops = kdata;
+	struct bpf_thp_ops *old = old_kdata;
+	int ret = 0;
+
+	if (!ops || !old)
+		return -EINVAL;
+
+	spin_lock(&thp_ops_lock);
+	/* The prog has aleady been removed. */
+	if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, &transparent_hugepage_flags)) {
+		ret = -ENOENT;
+		goto out;
+	}
+	WARN_ON_ONCE(!rcu_access_pointer(bpf_thp.get_suggested_order));
+	rcu_replace_pointer(bpf_thp.get_suggested_order, ops->get_suggested_order,
+			    lockdep_is_held(&thp_ops_lock));
+
+out:
+	spin_unlock(&thp_ops_lock);
+	if (!ret)
+		synchronize_rcu();
+	return ret;
+}
+
+static int bpf_thp_validate(void *kdata)
+{
+	struct bpf_thp_ops *ops = kdata;
+
+	if (!ops->get_suggested_order) {
+		pr_err("bpf_thp: required ops isn't implemented\n");
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static int suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
+			   u64 vma_flags, enum tva_type vm_flags, int orders)
+{
+	return orders;
+}
+
+static struct bpf_thp_ops __bpf_thp_ops = {
+	.get_suggested_order = suggested_order,
+};
+
+static struct bpf_struct_ops bpf_bpf_thp_ops = {
+	.verifier_ops = &thp_bpf_verifier_ops,
+	.init = bpf_thp_init,
+	.init_member = bpf_thp_init_member,
+	.reg = bpf_thp_reg,
+	.unreg = bpf_thp_unreg,
+	.update = bpf_thp_update,
+	.validate = bpf_thp_validate,
+	.cfi_stubs = &__bpf_thp_ops,
+	.owner = THIS_MODULE,
+	.name = "bpf_thp_ops",
+};
+
+static int __init bpf_thp_ops_init(void)
+{
+	int err = register_bpf_struct_ops(&bpf_bpf_thp_ops, bpf_thp_ops);
+
+	if (err)
+		pr_err("bpf_thp: Failed to register struct_ops (%d)\n", err);
+	return err;
+}
+late_initcall(bpf_thp_ops_init);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d89992b65acc..bd8f8f34ab3c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1349,6 +1349,16 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 		return ret;
 	khugepaged_enter_vma(vma, vma->vm_flags);
 
+	/*
+	 * This check must occur after khugepaged_enter_vma() because:
+	 * 1. We may permit THP allocation via khugepaged
+	 * 2. While simultaneously disallowing THP allocation
+	 *    during page fault handling
+	 */
+	if (get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_PAGEFAULT, BIT(PMD_ORDER)) !=
+				BIT(PMD_ORDER))
+		return VM_FAULT_FALLBACK;
+
 	if (!(vmf->flags & FAULT_FLAG_WRITE) &&
 			!mm_forbids_zeropage(vma->vm_mm) &&
 			transparent_hugepage_use_zero_page()) {
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d3d4f116e14b..935583626db6 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -474,7 +474,9 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
 {
 	if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
 	    hugepage_pmd_enabled()) {
-		if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
+		if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER) &&
+		    get_suggested_order(vma->vm_mm, vma, vm_flags, TVA_KHUGEPAGED,
+					BIT(PMD_ORDER)))
 			__khugepaged_enter(vma->vm_mm);
 	}
 }
@@ -934,6 +936,8 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 		return SCAN_ADDRESS_RANGE;
 	if (!thp_vma_allowable_order(vma, vma->vm_flags, type, PMD_ORDER))
 		return SCAN_VMA_CHECK;
+	if (!get_suggested_order(vma->vm_mm, vma, vma->vm_flags, type, BIT(PMD_ORDER)))
+		return SCAN_VMA_CHECK;
 	/*
 	 * Anon VMA expected, the address may be unmapped then
 	 * remapped to file after khugepaged reaquired the mmap_lock.
@@ -1465,6 +1469,11 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
 		/* khugepaged_mm_lock actually not necessary for the below */
 		mm_slot_free(mm_slot_cache, mm_slot);
 		mmdrop(mm);
+	} else if (!get_suggested_order(mm, NULL, 0, -1, BIT(PMD_ORDER))) {
+		hash_del(&slot->hash);
+		list_del(&slot->mm_node);
+		mm_flags_clear(MMF_VM_HUGEPAGE, mm);
+		mm_slot_free(mm_slot_cache, mm_slot);
 	}
 }
 
@@ -1538,6 +1547,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
 		return SCAN_VMA_CHECK;
 
+	if (!get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_FORCED_COLLAPSE,
+				 BIT(PMD_ORDER)))
+		return SCAN_VMA_CHECK;
 	/* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
 	if (userfaultfd_wp(vma))
 		return SCAN_PTE_UFFD_WP;
@@ -2416,6 +2428,10 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 	 * the next mm on the list.
 	 */
 	vma = NULL;
+
+	/* If this mm is not suitable for the scan list, we should remove it. */
+	if (!get_suggested_order(mm, NULL, 0, -1, BIT(PMD_ORDER)))
+		goto breakouterloop_mmap_lock;
 	if (unlikely(!mmap_read_trylock(mm)))
 		goto breakouterloop_mmap_lock;
 
@@ -2432,7 +2448,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 			progress++;
 			break;
 		}
-		if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
+		if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER) ||
+		    !get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_KHUGEPAGED,
+					 BIT(PMD_ORDER))) {
 skip:
 			progress++;
 			continue;
@@ -2769,6 +2787,10 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 	if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
 		return -EINVAL;
 
+	if (!get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_FORCED_COLLAPSE,
+				 BIT(PMD_ORDER)))
+		return -EINVAL;
+
 	cc = kmalloc(sizeof(*cc), GFP_KERNEL);
 	if (!cc)
 		return -ENOMEM;
diff --git a/mm/memory.c b/mm/memory.c
index d9de6c056179..0178857aa058 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4486,6 +4486,7 @@ static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset,
 static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
+	int order, suggested_orders;
 	unsigned long orders;
 	struct folio *folio;
 	unsigned long addr;
@@ -4493,7 +4494,6 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 	spinlock_t *ptl;
 	pte_t *pte;
 	gfp_t gfp;
-	int order;
 
 	/*
 	 * If uffd is active for the vma we need per-page fault fidelity to
@@ -4510,13 +4510,18 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 	if (!zswap_never_enabled())
 		goto fallback;
 
+	suggested_orders = get_suggested_order(vma->vm_mm, vma, vma->vm_flags,
+					       TVA_PAGEFAULT,
+					       BIT(PMD_ORDER) - 1);
+	if (!suggested_orders)
+		goto fallback;
 	entry = pte_to_swp_entry(vmf->orig_pte);
 	/*
 	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
 	 * and suitable for swapping THP.
 	 */
 	orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT,
-					  BIT(PMD_ORDER) - 1);
+					  suggested_orders);
 	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
 	orders = thp_swap_suitable_orders(swp_offset(entry),
 					  vmf->address, orders);
@@ -5044,12 +5049,12 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	int order, suggested_orders;
 	unsigned long orders;
 	struct folio *folio;
 	unsigned long addr;
 	pte_t *pte;
 	gfp_t gfp;
-	int order;
 
 	/*
 	 * If uffd is active for the vma we need per-page fault fidelity to
@@ -5058,13 +5063,18 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
 	if (unlikely(userfaultfd_armed(vma)))
 		goto fallback;
 
+	suggested_orders = get_suggested_order(vma->vm_mm, vma, vma->vm_flags,
+					       TVA_PAGEFAULT,
+					       BIT(PMD_ORDER) - 1);
+	if (!suggested_orders)
+		goto fallback;
 	/*
 	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
 	 * for this vma. Then filter out the orders that can't be allocated over
 	 * the faulting address and still be fully contained in the vma.
 	 */
 	orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT,
-					  BIT(PMD_ORDER) - 1);
+					  suggested_orders);
 	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
 
 	if (!orders)
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v6 mm-new 02/10] mm: thp: add a new kfunc bpf_mm_get_mem_cgroup()
  2025-08-26  7:19 [PATCH v6 mm-new 00/10] mm, bpf: BPF based THP order selection Yafang Shao
  2025-08-26  7:19 ` [PATCH v6 mm-new 01/10] mm: thp: add support for " Yafang Shao
@ 2025-08-26  7:19 ` Yafang Shao
  2025-08-27 15:34   ` Lorenzo Stoakes
  2025-08-27 20:45   ` Shakeel Butt
  2025-08-26  7:19 ` [PATCH v6 mm-new 03/10] mm: thp: add a new kfunc bpf_mm_get_task() Yafang Shao
                   ` (9 subsequent siblings)
  11 siblings, 2 replies; 61+ messages in thread
From: Yafang Shao @ 2025-08-26  7:19 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet
  Cc: bpf, linux-mm, linux-doc, Yafang Shao

We will utilize this new kfunc bpf_mm_get_mem_cgroup() to retrieve the
associated mem_cgroup from the given @mm. The obtained mem_cgroup must
be released by calling bpf_put_mem_cgroup() as a paired operation.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 mm/bpf_thp.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 50 insertions(+), 1 deletion(-)

diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
index fbff3b1bb988..b757e8f425fd 100644
--- a/mm/bpf_thp.c
+++ b/mm/bpf_thp.c
@@ -175,10 +175,59 @@ static struct bpf_struct_ops bpf_bpf_thp_ops = {
 	.name = "bpf_thp_ops",
 };
 
+__bpf_kfunc_start_defs();
+
+/**
+ * bpf_mm_get_mem_cgroup - Get the memory cgroup associated with a mm_struct.
+ * @mm: The mm_struct to query
+ *
+ * The obtained mem_cgroup must be released by calling bpf_put_mem_cgroup().
+ *
+ * Return: The associated mem_cgroup on success, or NULL on failure. Note that
+ * this function depends on CONFIG_MEMCG being enabled - it will always return
+ * NULL if CONFIG_MEMCG is not configured.
+ */
+__bpf_kfunc struct mem_cgroup *bpf_mm_get_mem_cgroup(struct mm_struct *mm)
+{
+	return get_mem_cgroup_from_mm(mm);
+}
+
+/**
+ * bpf_put_mem_cgroup - Release a memory cgroup obtained from bpf_mm_get_mem_cgroup()
+ * @memcg: The memory cgroup to release
+ */
+__bpf_kfunc void bpf_put_mem_cgroup(struct mem_cgroup *memcg)
+{
+#ifdef CONFIG_MEMCG
+	if (!memcg)
+		return;
+	css_put(&memcg->css);
+#endif
+}
+
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(bpf_thp_ids)
+BTF_ID_FLAGS(func, bpf_mm_get_mem_cgroup, KF_TRUSTED_ARGS | KF_ACQUIRE | KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_put_mem_cgroup, KF_RELEASE)
+BTF_KFUNCS_END(bpf_thp_ids)
+
+static const struct btf_kfunc_id_set bpf_thp_set = {
+	.owner = THIS_MODULE,
+	.set = &bpf_thp_ids,
+};
+
 static int __init bpf_thp_ops_init(void)
 {
-	int err = register_bpf_struct_ops(&bpf_bpf_thp_ops, bpf_thp_ops);
+	int err;
+
+	err = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &bpf_thp_set);
+	if (err) {
+		pr_err("bpf_thp: Failed to register kfunc sets (%d)\n", err);
+		return err;
+	}
 
+	err = register_bpf_struct_ops(&bpf_bpf_thp_ops, bpf_thp_ops);
 	if (err)
 		pr_err("bpf_thp: Failed to register struct_ops (%d)\n", err);
 	return err;
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v6 mm-new 03/10] mm: thp: add a new kfunc bpf_mm_get_task()
  2025-08-26  7:19 [PATCH v6 mm-new 00/10] mm, bpf: BPF based THP order selection Yafang Shao
  2025-08-26  7:19 ` [PATCH v6 mm-new 01/10] mm: thp: add support for " Yafang Shao
  2025-08-26  7:19 ` [PATCH v6 mm-new 02/10] mm: thp: add a new kfunc bpf_mm_get_mem_cgroup() Yafang Shao
@ 2025-08-26  7:19 ` Yafang Shao
  2025-08-27 15:42   ` Lorenzo Stoakes
  2025-08-26  7:19 ` [PATCH v6 mm-new 04/10] bpf: mark vma->vm_mm as trusted Yafang Shao
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 61+ messages in thread
From: Yafang Shao @ 2025-08-26  7:19 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet
  Cc: bpf, linux-mm, linux-doc, Yafang Shao

We will utilize this new kfunc bpf_mm_get_task() to retrieve the
associated task_struct from the given @mm. The obtained task_struct must
be released by calling bpf_task_release() as a paired operation.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 mm/bpf_thp.c | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
index b757e8f425fd..46b3bc96359e 100644
--- a/mm/bpf_thp.c
+++ b/mm/bpf_thp.c
@@ -205,11 +205,45 @@ __bpf_kfunc void bpf_put_mem_cgroup(struct mem_cgroup *memcg)
 #endif
 }
 
+/**
+ * bpf_mm_get_task - Get the task struct associated with a mm_struct.
+ * @mm: The mm_struct to query
+ *
+ * The obtained task_struct must be released by calling bpf_task_release().
+ *
+ * Return: The associated task_struct on success, or NULL on failure. Note that
+ * this function depends on CONFIG_MEMCG being enabled - it will always return
+ * NULL if CONFIG_MEMCG is not configured.
+ */
+__bpf_kfunc struct task_struct *bpf_mm_get_task(struct mm_struct *mm)
+{
+#ifdef CONFIG_MEMCG
+	struct task_struct *task;
+
+	if (!mm)
+		return NULL;
+	rcu_read_lock();
+	task = rcu_dereference(mm->owner);
+	if (!task)
+		goto out;
+	if (!refcount_inc_not_zero(&task->rcu_users))
+		goto out;
+
+	rcu_read_unlock();
+	return task;
+
+out:
+	rcu_read_unlock();
+#endif
+	return NULL;
+}
+
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(bpf_thp_ids)
 BTF_ID_FLAGS(func, bpf_mm_get_mem_cgroup, KF_TRUSTED_ARGS | KF_ACQUIRE | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_put_mem_cgroup, KF_RELEASE)
+BTF_ID_FLAGS(func, bpf_mm_get_task, KF_TRUSTED_ARGS | KF_ACQUIRE | KF_RET_NULL)
 BTF_KFUNCS_END(bpf_thp_ids)
 
 static const struct btf_kfunc_id_set bpf_thp_set = {
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v6 mm-new 04/10] bpf: mark vma->vm_mm as trusted
  2025-08-26  7:19 [PATCH v6 mm-new 00/10] mm, bpf: BPF based THP order selection Yafang Shao
                   ` (2 preceding siblings ...)
  2025-08-26  7:19 ` [PATCH v6 mm-new 03/10] mm: thp: add a new kfunc bpf_mm_get_task() Yafang Shao
@ 2025-08-26  7:19 ` Yafang Shao
  2025-08-27 15:45   ` Lorenzo Stoakes
  2025-08-26  7:19 ` [PATCH v6 mm-new 05/10] selftests/bpf: add a simple BPF based THP policy Yafang Shao
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 61+ messages in thread
From: Yafang Shao @ 2025-08-26  7:19 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet
  Cc: bpf, linux-mm, linux-doc, Yafang Shao

Every VMA must have an associated mm_struct, and it is safe to access
outside of RCU. Thus, we can mark it as trusted. With this change, BPF
helpers can safely access vma->vm_mm to retrieve the associated task
from the VMA.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 kernel/bpf/verifier.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index c4f69a9e9af6..984ffbca5cbe 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -7154,6 +7154,10 @@ BTF_TYPE_SAFE_TRUSTED(struct file) {
 	struct inode *f_inode;
 };
 
+BTF_TYPE_SAFE_TRUSTED(struct vm_area_struct) {
+	struct mm_struct *vm_mm;
+};
+
 BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct dentry) {
 	struct inode *d_inode;
 };
@@ -7193,6 +7197,7 @@ static bool type_is_trusted(struct bpf_verifier_env *env,
 	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED(struct bpf_iter__task));
 	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED(struct linux_binprm));
 	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED(struct file));
+	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED(struct vm_area_struct));
 
 	return btf_nested_type_is_trusted(&env->log, reg, field_name, btf_id, "__safe_trusted");
 }
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v6 mm-new 05/10] selftests/bpf: add a simple BPF based THP policy
  2025-08-26  7:19 [PATCH v6 mm-new 00/10] mm, bpf: BPF based THP order selection Yafang Shao
                   ` (3 preceding siblings ...)
  2025-08-26  7:19 ` [PATCH v6 mm-new 04/10] bpf: mark vma->vm_mm as trusted Yafang Shao
@ 2025-08-26  7:19 ` Yafang Shao
  2025-08-26  7:19 ` [PATCH v6 mm-new 06/10] selftests/bpf: add test case for khugepaged fork Yafang Shao
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2025-08-26  7:19 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet
  Cc: bpf, linux-mm, linux-doc, Yafang Shao

This selftest verifies that PMD-mapped THP allocation is restricted in
page faults for tasks within a specific cgroup, while still permitting
THP allocation via khugepaged.

Since THP allocation depends on various factors (e.g., system memory
pressure), using the actual allocated THP size for validation is
unreliable. Instead, we check the return value of get_suggested_order(),
which indicates whether the system intends to allocate a THP, regardless of
whether the allocation ultimately succeeds.

This test case defines a simple THP policy. The policy permits
PMD-mapped THP allocation through khugepaged for tasks in a designated
cgroup, but prohibits it for all other tasks and contexts, including the
page fault handler. However, khugepaged might not run immediately during
this test, making its count metrics unreliable.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 tools/testing/selftests/bpf/config            |   3 +
 .../selftests/bpf/prog_tests/thp_adjust.c     | 254 ++++++++++++++++++
 .../selftests/bpf/progs/test_thp_adjust.c     |  76 ++++++
 3 files changed, 333 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c

diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
index 8916ab814a3e..27f0249c7600 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -26,6 +26,7 @@ CONFIG_DMABUF_HEAPS=y
 CONFIG_DMABUF_HEAPS_SYSTEM=y
 CONFIG_DUMMY=y
 CONFIG_DYNAMIC_FTRACE=y
+CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION=y
 CONFIG_FPROBE=y
 CONFIG_FTRACE_SYSCALLS=y
 CONFIG_FUNCTION_ERROR_INJECTION=y
@@ -51,6 +52,7 @@ CONFIG_IPV6_TUNNEL=y
 CONFIG_KEYS=y
 CONFIG_LIRC=y
 CONFIG_LWTUNNEL=y
+CONFIG_MEMCG=y
 CONFIG_MODULE_SIG=y
 CONFIG_MODULE_SRCVERSION_ALL=y
 CONFIG_MODULE_UNLOAD=y
@@ -114,6 +116,7 @@ CONFIG_SECURITY=y
 CONFIG_SECURITYFS=y
 CONFIG_SYN_COOKIES=y
 CONFIG_TEST_BPF=m
+CONFIG_TRANSPARENT_HUGEPAGE=y
 CONFIG_UDMABUF=y
 CONFIG_USERFAULTFD=y
 CONFIG_VSOCKETS=y
diff --git a/tools/testing/selftests/bpf/prog_tests/thp_adjust.c b/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
new file mode 100644
index 000000000000..a4a34ee28301
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
@@ -0,0 +1,254 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <math.h>
+#include <sys/mman.h>
+#include <test_progs.h>
+#include "cgroup_helpers.h"
+#include "test_thp_adjust.skel.h"
+
+#define LEN (16 * 1024 * 1024) /* 16MB */
+#define THP_ENABLED_FILE "/sys/kernel/mm/transparent_hugepage/enabled"
+#define PMD_SIZE_FILE "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size"
+
+static struct test_thp_adjust *skel;
+static char *thp_addr, old_mode[32];
+static long pagesize;
+
+static int thp_mode_save(void)
+{
+	const char *start, *end;
+	char buf[128];
+	int fd, err;
+	size_t len;
+
+	fd = open(THP_ENABLED_FILE, O_RDONLY);
+	if (fd == -1)
+		return -1;
+
+	err = read(fd, buf, sizeof(buf) - 1);
+	if (err == -1)
+		goto close;
+
+	start = strchr(buf, '[');
+	end = start ? strchr(start, ']') : NULL;
+	if (!start || !end || end <= start) {
+		err = -1;
+		goto close;
+	}
+
+	len = end - start - 1;
+	if (len >= sizeof(old_mode))
+		len = sizeof(old_mode) - 1;
+	strncpy(old_mode, start + 1, len);
+	old_mode[len] = '\0';
+
+close:
+	close(fd);
+	return err;
+}
+
+static int thp_mode_set(const char *desired_mode)
+{
+	int fd, err;
+
+	fd = open(THP_ENABLED_FILE, O_RDWR);
+	if (fd == -1)
+		return -1;
+
+	err = write(fd, desired_mode, strlen(desired_mode));
+	close(fd);
+	return err;
+}
+
+static int thp_mode_reset(void)
+{
+	int fd, err;
+
+	fd = open(THP_ENABLED_FILE, O_WRONLY);
+	if (fd == -1)
+		return -1;
+
+	err = write(fd, old_mode, strlen(old_mode));
+	close(fd);
+	return err;
+}
+
+static int thp_alloc(void)
+{
+	int err, i;
+
+	thp_addr = mmap(NULL, LEN, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
+	if (thp_addr == MAP_FAILED)
+		return -1;
+
+	err = madvise(thp_addr, LEN, MADV_HUGEPAGE);
+	if (err == -1)
+		goto unmap;
+
+	/* Accessing a single byte within a page is sufficient to trigger a page fault. */
+	for (i = 0; i < LEN; i += pagesize)
+		thp_addr[i] = 1;
+	return 0;
+
+unmap:
+	munmap(thp_addr, LEN);
+	return -1;
+}
+
+static void thp_free(void)
+{
+	if (!thp_addr)
+		return;
+	munmap(thp_addr, LEN);
+}
+
+static int get_pmd_order(void)
+{
+	ssize_t bytes_read, size;
+	int fd, order, ret = -1;
+	char buf[64], *endptr;
+
+	fd = open(PMD_SIZE_FILE, O_RDONLY);
+	if (fd < 0)
+		return -1;
+
+	bytes_read = read(fd, buf, sizeof(buf) - 1);
+	if (bytes_read <= 0)
+		goto close_fd;
+
+	/* Remove potential newline character */
+	if (buf[bytes_read - 1] == '\n')
+		buf[bytes_read - 1] = '\0';
+
+	size = strtoul(buf, &endptr, 10);
+	if (endptr == buf || *endptr != '\0')
+		goto close_fd;
+	if (size % pagesize != 0)
+		goto close_fd;
+	ret = size / pagesize;
+	if ((ret & (ret - 1)) == 0) {
+		order = 0;
+		while (ret > 1) {
+			ret >>= 1;
+			order++;
+		}
+		ret = order;
+	}
+
+close_fd:
+	close(fd);
+	return ret;
+}
+
+static void subtest_thp_policy(void)
+{
+	struct bpf_link *fentry_link, *ops_link;
+
+	/* After attaching struct_ops, THP will be allocated only in khugepaged . */
+	ops_link = bpf_map__attach_struct_ops(skel->maps.khugepaged_ops);
+	if (!ASSERT_OK_PTR(ops_link, "attach struct_ops"))
+		return;
+
+	/* Create a new BPF program to detect the result. */
+	fentry_link = bpf_program__attach_trace(skel->progs.thp_run);
+	if (!ASSERT_OK_PTR(fentry_link, "attach fentry"))
+		goto detach_ops;
+	if (!ASSERT_NEQ(thp_alloc(), -1, "THP alloc"))
+		goto detach;
+
+	if (!ASSERT_EQ(skel->bss->pf_alloc, 0, "alloc_in_pf"))
+		goto thp_free;
+	if (!ASSERT_GT(skel->bss->pf_disallow, 0, "disallow_in_pf"))
+		goto thp_free;
+
+	ASSERT_EQ(skel->bss->khugepaged_disallow, 0, "disallow_in_khugepaged");
+thp_free:
+	thp_free();
+detach:
+	bpf_link__destroy(fentry_link);
+detach_ops:
+	bpf_link__destroy(ops_link);
+}
+
+static int thp_adjust_setup(void)
+{
+	int err, cgrp_fd, cgrp_id, pmd_order;
+
+	pagesize = sysconf(_SC_PAGESIZE);
+	pmd_order = get_pmd_order();
+	if (!ASSERT_NEQ(pmd_order, -1, "get_pmd_order"))
+		return -1;
+
+	err = setup_cgroup_environment();
+	if (!ASSERT_OK(err, "cgrp_env_setup"))
+		return -1;
+
+	cgrp_fd = create_and_get_cgroup("thp_adjust");
+	if (!ASSERT_GE(cgrp_fd, 0, "create_and_get_cgroup"))
+		goto cleanup;
+	close(cgrp_fd);
+
+	err = join_cgroup("thp_adjust");
+	if (!ASSERT_OK(err, "join_cgroup"))
+		goto remove_cgrp;
+
+	err = -1;
+	cgrp_id = get_cgroup_id("thp_adjust");
+	if (!ASSERT_GE(cgrp_id, 0, "create_and_get_cgroup"))
+		goto join_root;
+
+	if (!ASSERT_NEQ(thp_mode_save(), -1, "THP mode save"))
+		goto join_root;
+	if (!ASSERT_GE(thp_mode_set("madvise"), 0, "THP mode set"))
+		goto join_root;
+
+	skel = test_thp_adjust__open();
+	if (!ASSERT_OK_PTR(skel, "open"))
+		goto thp_reset;
+
+	skel->bss->cgrp_id = cgrp_id;
+	skel->bss->pmd_order = pmd_order;
+
+	err = test_thp_adjust__load(skel);
+	if (!ASSERT_OK(err, "load"))
+		goto destroy;
+	return 0;
+
+destroy:
+	test_thp_adjust__destroy(skel);
+thp_reset:
+	ASSERT_GE(thp_mode_reset(), 0, "THP mode reset");
+join_root:
+	/* We must join the root cgroup before removing the created cgroup. */
+	err = join_root_cgroup();
+	ASSERT_OK(err, "join_cgroup to root");
+remove_cgrp:
+	remove_cgroup("thp_adjust");
+cleanup:
+	cleanup_cgroup_environment();
+	return err;
+}
+
+static void thp_adjust_destroy(void)
+{
+	int err;
+
+	test_thp_adjust__destroy(skel);
+	ASSERT_GE(thp_mode_reset(), 0, "THP mode reset");
+	err = join_root_cgroup();
+	ASSERT_OK(err, "join_cgroup to root");
+	if (!err)
+		remove_cgroup("thp_adjust");
+	cleanup_cgroup_environment();
+}
+
+void test_thp_adjust(void)
+{
+	if (thp_adjust_setup() == -1)
+		return;
+
+	if (test__start_subtest("alloc_in_khugepaged"))
+		subtest_thp_policy();
+
+	thp_adjust_destroy();
+}
diff --git a/tools/testing/selftests/bpf/progs/test_thp_adjust.c b/tools/testing/selftests/bpf/progs/test_thp_adjust.c
new file mode 100644
index 000000000000..635915f31786
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_thp_adjust.c
@@ -0,0 +1,76 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+int pf_alloc, pf_disallow, khugepaged_disallow;
+struct mm_struct *target_mm;
+int pmd_order, cgrp_id;
+
+/* Detecting whether a task can successfully allocate THP is unreliable because
+ * it may be influenced by system memory pressure. Instead of making the result
+ * dependent on unpredictable factors, we should simply check
+ * get_suggested_order()'s return value, which is deterministic.
+ */
+SEC("fexit/get_suggested_order")
+int BPF_PROG(thp_run, struct mm_struct *mm, struct vm_area_struct *vma__nullable,
+	     u64 vma_flags, u64 tva_flags, int orders, int retval)
+{
+	if (mm != target_mm)
+		return 0;
+
+	if (orders != (1 << pmd_order))
+		return 0;
+
+	if (tva_flags == TVA_PAGEFAULT) {
+		if (retval == (1 << pmd_order))
+			pf_alloc++;
+		else if (!retval)
+			pf_disallow++;
+	} else if (tva_flags == TVA_KHUGEPAGED || tva_flags == -1) {
+		/* khugepaged is not triggered immediately, so its allocation
+		 * counts are unreliable.
+		 */
+		if (!retval)
+			khugepaged_disallow++;
+	}
+	return 0;
+}
+
+SEC("struct_ops/get_suggested_order")
+int BPF_PROG(alloc_in_khugepaged, struct mm_struct *mm, struct vm_area_struct *vma__nullable,
+	     u64 vma_flags, enum tva_type tva_flags, int orders)
+{
+	struct mem_cgroup *memcg;
+	int suggested_orders = 0;
+
+	if (orders != (1 << pmd_order))
+		return 0;
+
+	/* Only works when CONFIG_MEMCG is enabled. */
+	memcg = bpf_mm_get_mem_cgroup(mm);
+	if (!memcg)
+		return 0;
+
+	if (memcg->css.cgroup->kn->id == cgrp_id) {
+		if (!target_mm)
+			target_mm = mm;
+
+		/* BPF THP allocation policy:
+		 * - Allow PMD allocation in khugepagd only
+		 */
+		if (tva_flags == TVA_KHUGEPAGED || tva_flags == -1)
+			suggested_orders = orders;
+	}
+
+	bpf_put_mem_cgroup(memcg);
+	return suggested_orders;
+}
+
+SEC(".struct_ops.link")
+struct bpf_thp_ops khugepaged_ops = {
+	.get_suggested_order = (void *)alloc_in_khugepaged,
+};
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v6 mm-new 06/10] selftests/bpf: add test case for khugepaged fork
  2025-08-26  7:19 [PATCH v6 mm-new 00/10] mm, bpf: BPF based THP order selection Yafang Shao
                   ` (4 preceding siblings ...)
  2025-08-26  7:19 ` [PATCH v6 mm-new 05/10] selftests/bpf: add a simple BPF based THP policy Yafang Shao
@ 2025-08-26  7:19 ` Yafang Shao
  2025-08-26  7:19 ` [PATCH v6 mm-new 07/10] selftests/bpf: add test case to update thp policy Yafang Shao
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2025-08-26  7:19 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet
  Cc: bpf, linux-mm, linux-doc, Yafang Shao

In this test case, the parent is allowed to alloc THP, but the child
forked by it can't alloc THP.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 .../selftests/bpf/prog_tests/thp_adjust.c     | 59 +++++++++++++++++++
 .../selftests/bpf/progs/test_thp_adjust.c     | 39 ++++++++++++
 2 files changed, 98 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/thp_adjust.c b/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
index a4a34ee28301..bf367c6e6f52 100644
--- a/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
+++ b/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
@@ -1,5 +1,11 @@
 // SPDX-License-Identifier: GPL-2.0
 
+#define _GNU_SOURCE
+#include <sched.h>
+#include <sys/wait.h>
+#include <sys/syscall.h>
+#include <linux/sched.h>
+
 #include <math.h>
 #include <sys/mman.h>
 #include <test_progs.h>
@@ -170,6 +176,57 @@ static void subtest_thp_policy(void)
 	bpf_link__destroy(ops_link);
 }
 
+/*
+ * In this test case, we clone the child process directly into the root cgroup.
+ * Consequently, the child process is not permitted to alloc THP.
+ */
+static void subtest_thp_fork(void)
+{
+	struct clone_args args = {
+		.flags = CLONE_INTO_CGROUP,
+		.exit_signal = SIGCHLD,
+	};
+	struct bpf_link *ops_link;
+	int status, err;
+	pid_t pid;
+
+	skel->bss->ppid = getpid();
+	if (!ASSERT_GT(skel->bss->ppid, 0, "getpid"))
+		return;
+	args.cgroup = get_root_cgroup();
+	if (!ASSERT_GE(args.cgroup, 0, "get_root_cgrp_fd"))
+		return;
+
+	ops_link = bpf_map__attach_struct_ops(skel->maps.thp_fork_ops);
+	if (!ASSERT_OK_PTR(ops_link, "attach struct_ops"))
+		return;
+
+	pid = syscall(__NR_clone3, &args, sizeof(args));
+	if (!ASSERT_GE(pid, 0, "clone3"))
+		goto detach_ops;
+
+	if (pid == 0) {
+		/* child */
+		if (!ASSERT_NEQ(thp_alloc(), -1, "THP alloc"))
+			exit(EXIT_FAILURE);
+		thp_free();
+		exit(EXIT_SUCCESS);
+	}
+
+	err = waitpid(pid, &status, 0);
+	if (!ASSERT_EQ(err, pid, "waitpid"))
+		goto detach_ops;
+	ASSERT_EQ(skel->bss->fork_fail, 0, "fork_fail");
+	ASSERT_GT(skel->bss->fork_succeed, 0, "fork_succeed");
+
+	if (!ASSERT_NEQ(thp_alloc(), -1, "THP alloc"))
+		goto detach_ops;
+	thp_free();
+	ASSERT_GT(skel->bss->parent_succeed, 0, "parent_succeed");
+detach_ops:
+	bpf_link__destroy(ops_link);
+}
+
 static int thp_adjust_setup(void)
 {
 	int err, cgrp_fd, cgrp_id, pmd_order;
@@ -249,6 +306,8 @@ void test_thp_adjust(void)
 
 	if (test__start_subtest("alloc_in_khugepaged"))
 		subtest_thp_policy();
+	if (test__start_subtest("khugepaged_fork"))
+		subtest_thp_fork();
 
 	thp_adjust_destroy();
 }
diff --git a/tools/testing/selftests/bpf/progs/test_thp_adjust.c b/tools/testing/selftests/bpf/progs/test_thp_adjust.c
index 635915f31786..034086ce2f3d 100644
--- a/tools/testing/selftests/bpf/progs/test_thp_adjust.c
+++ b/tools/testing/selftests/bpf/progs/test_thp_adjust.c
@@ -6,6 +6,7 @@
 
 char _license[] SEC("license") = "GPL";
 
+int ppid, fork_fail, fork_succeed, parent_succeed;
 int pf_alloc, pf_disallow, khugepaged_disallow;
 struct mm_struct *target_mm;
 int pmd_order, cgrp_id;
@@ -74,3 +75,41 @@ SEC(".struct_ops.link")
 struct bpf_thp_ops khugepaged_ops = {
 	.get_suggested_order = (void *)alloc_in_khugepaged,
 };
+
+
+SEC("struct_ops/get_suggested_order")
+int BPF_PROG(thp_fork_test, struct mm_struct *mm, struct vm_area_struct *vma__nullable,
+	     u64 vma_flags, enum tva_type tva_flags, int orders)
+{
+	struct task_struct *p = bpf_get_current_task_btf();
+	struct mem_cgroup *memcg;
+	int suggested_orders = 0;
+
+	/* Only works when CONFIG_MEMCG is enabled. */
+	memcg = bpf_mm_get_mem_cgroup(mm);
+	if (!memcg)
+		return 0;
+
+	/* The tasks under this specific cgroup are allowed to alloc THP */
+	if (memcg->css.cgroup->kn->id == cgrp_id)
+		suggested_orders = orders;
+
+	if (p->parent->pid == ppid) {
+		/* The child is forked into root cgrp, so it can't alloc THP */
+		if (suggested_orders)
+			fork_fail++;
+		else
+			fork_succeed++;
+	} else if (p->pid == ppid) {
+		/* The parent can alloc THP */
+		if (suggested_orders)
+			parent_succeed++;
+	}
+	bpf_put_mem_cgroup(memcg);
+	return suggested_orders;
+}
+
+SEC(".struct_ops.link")
+struct bpf_thp_ops thp_fork_ops = {
+	.get_suggested_order = (void *)thp_fork_test,
+};
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v6 mm-new 07/10] selftests/bpf: add test case to update thp policy
  2025-08-26  7:19 [PATCH v6 mm-new 00/10] mm, bpf: BPF based THP order selection Yafang Shao
                   ` (5 preceding siblings ...)
  2025-08-26  7:19 ` [PATCH v6 mm-new 06/10] selftests/bpf: add test case for khugepaged fork Yafang Shao
@ 2025-08-26  7:19 ` Yafang Shao
  2025-08-26  7:19 ` [PATCH v6 mm-new 08/10] selftests/bpf: add test cases for invalid thp_adjust usage Yafang Shao
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2025-08-26  7:19 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet
  Cc: bpf, linux-mm, linux-doc, Yafang Shao

EBUSY is returned when attempting to install a new BPF program while one is
already running, though updates to existing programs are permitted.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 .../selftests/bpf/prog_tests/thp_adjust.c     | 23 +++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/thp_adjust.c b/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
index bf367c6e6f52..6e65d7b0eb80 100644
--- a/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
+++ b/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
@@ -227,6 +227,27 @@ static void subtest_thp_fork(void)
 	bpf_link__destroy(ops_link);
 }
 
+static void subtest_thp_policy_update(void)
+{
+	struct bpf_link *old_link, *new_link;
+	int err;
+
+	old_link = bpf_map__attach_struct_ops(skel->maps.thp_fork_ops);
+	if (!ASSERT_OK_PTR(old_link, "attach_old_link"))
+		return;
+
+	new_link = bpf_map__attach_struct_ops(skel->maps.khugepaged_ops);
+	if (!ASSERT_NULL(new_link, "attach_new_link"))
+		goto destory_old;
+	ASSERT_EQ(errno, EBUSY, "attach_new_link");
+
+	err = bpf_link__update_map(old_link, skel->maps.khugepaged_ops);
+	ASSERT_EQ(err, 0, "update_old_link");
+
+destory_old:
+	bpf_link__destroy(old_link);
+}
+
 static int thp_adjust_setup(void)
 {
 	int err, cgrp_fd, cgrp_id, pmd_order;
@@ -308,6 +329,8 @@ void test_thp_adjust(void)
 		subtest_thp_policy();
 	if (test__start_subtest("khugepaged_fork"))
 		subtest_thp_fork();
+	if (test__start_subtest("policy_update"))
+		subtest_thp_policy_update();
 
 	thp_adjust_destroy();
 }
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v6 mm-new 08/10] selftests/bpf: add test cases for invalid thp_adjust usage
  2025-08-26  7:19 [PATCH v6 mm-new 00/10] mm, bpf: BPF based THP order selection Yafang Shao
                   ` (6 preceding siblings ...)
  2025-08-26  7:19 ` [PATCH v6 mm-new 07/10] selftests/bpf: add test case to update thp policy Yafang Shao
@ 2025-08-26  7:19 ` Yafang Shao
  2025-08-26  7:19 ` [PATCH v6 mm-new 09/10] Documentation: add BPF-based THP adjustment documentation Yafang Shao
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2025-08-26  7:19 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet
  Cc: bpf, linux-mm, linux-doc, Yafang Shao

1. The trusted VMA pointer can be null and must be checked before
   dereferencing.
2. Resources acquired via bpf_mm_get_task() must be released with
   bpf_task_release().
3. Memory groups obtained through bpf_mm_get_mem_cgroup() must be released
   using bpf_put_mem_cgroup().

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 .../selftests/bpf/prog_tests/thp_adjust.c     |  7 +++++
 .../bpf/progs/test_thp_adjust_trusted_vma.c   | 27 +++++++++++++++++++
 .../progs/test_thp_adjust_unreleased_memcg.c  | 24 +++++++++++++++++
 .../progs/test_thp_adjust_unreleased_task.c   | 25 +++++++++++++++++
 4 files changed, 83 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_trusted_vma.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_unreleased_memcg.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_unreleased_task.c

diff --git a/tools/testing/selftests/bpf/prog_tests/thp_adjust.c b/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
index 6e65d7b0eb80..846679acaff2 100644
--- a/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
+++ b/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
@@ -11,6 +11,9 @@
 #include <test_progs.h>
 #include "cgroup_helpers.h"
 #include "test_thp_adjust.skel.h"
+#include "test_thp_adjust_trusted_vma.skel.h"
+#include "test_thp_adjust_unreleased_task.skel.h"
+#include "test_thp_adjust_unreleased_memcg.skel.h"
 
 #define LEN (16 * 1024 * 1024) /* 16MB */
 #define THP_ENABLED_FILE "/sys/kernel/mm/transparent_hugepage/enabled"
@@ -333,4 +336,8 @@ void test_thp_adjust(void)
 		subtest_thp_policy_update();
 
 	thp_adjust_destroy();
+
+	RUN_TESTS(test_thp_adjust_trusted_vma);
+	RUN_TESTS(test_thp_adjust_unreleased_task);
+	RUN_TESTS(test_thp_adjust_unreleased_memcg);
 }
diff --git a/tools/testing/selftests/bpf/progs/test_thp_adjust_trusted_vma.c b/tools/testing/selftests/bpf/progs/test_thp_adjust_trusted_vma.c
new file mode 100644
index 000000000000..caa73bebefcf
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_thp_adjust_trusted_vma.c
@@ -0,0 +1,27 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+#include "bpf_misc.h"
+
+char _license[] SEC("license") = "GPL";
+
+SEC("struct_ops/get_suggested_order")
+__failure __msg("R1 invalid mem access 'trusted_ptr_or_null_'")
+int BPF_PROG(thp_trusted_vma, struct mm_struct *mm, struct vm_area_struct *vma__nullable,
+	     u64 vma_flags, u64 tva_flags, int orders)
+{
+	struct mem_cgroup *memcg = bpf_mm_get_mem_cgroup(vma__nullable->vm_mm);
+
+	if (!memcg)
+		return 0;
+
+	bpf_put_mem_cgroup(memcg);
+	return 1;
+}
+SEC(".struct_ops.link")
+struct bpf_thp_ops thp_memcg_ops = {
+	.get_suggested_order = (void *)thp_trusted_vma,
+};
diff --git a/tools/testing/selftests/bpf/progs/test_thp_adjust_unreleased_memcg.c b/tools/testing/selftests/bpf/progs/test_thp_adjust_unreleased_memcg.c
new file mode 100644
index 000000000000..467befebb35f
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_thp_adjust_unreleased_memcg.c
@@ -0,0 +1,24 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+#include "bpf_misc.h"
+
+char _license[] SEC("license") = "GPL";
+
+SEC("struct_ops/get_suggested_order")
+__failure __msg("Unreleased reference")
+int BPF_PROG(thp_unreleased_memcg, struct mm_struct *mm, struct vm_area_struct *vma__nullable,
+	     u64 vma_flags, u64 tva_flags, int orders)
+{
+	struct mem_cgroup *memcg = bpf_mm_get_mem_cgroup(mm);
+
+	/* The memcg should be released with bpf_put_mem_cgroup() */
+	return memcg ? 0 : 1;
+}
+SEC(".struct_ops.link")
+struct bpf_thp_ops thp_memcg_ops = {
+	.get_suggested_order = (void *)thp_unreleased_memcg,
+};
diff --git a/tools/testing/selftests/bpf/progs/test_thp_adjust_unreleased_task.c b/tools/testing/selftests/bpf/progs/test_thp_adjust_unreleased_task.c
new file mode 100644
index 000000000000..50d756810412
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_thp_adjust_unreleased_task.c
@@ -0,0 +1,25 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+#include "bpf_misc.h"
+
+char _license[] SEC("license") = "GPL";
+
+SEC("struct_ops/get_suggested_order")
+__failure __msg("Unreleased reference")
+int BPF_PROG(thp_unreleased_task, struct mm_struct *mm, struct vm_area_struct *vma__nullable,
+	     u64 vma_flags, u64 tva_flags, int orders)
+{
+	struct task_struct *p = bpf_mm_get_task(mm);
+
+	/* The task should be released with bpf_task_release() */
+	return p ? 0 : 1;
+}
+
+SEC(".struct_ops.link")
+struct bpf_thp_ops thp_task_ops = {
+	.get_suggested_order = (void *)thp_unreleased_task,
+};
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v6 mm-new 09/10] Documentation: add BPF-based THP adjustment documentation
  2025-08-26  7:19 [PATCH v6 mm-new 00/10] mm, bpf: BPF based THP order selection Yafang Shao
                   ` (7 preceding siblings ...)
  2025-08-26  7:19 ` [PATCH v6 mm-new 08/10] selftests/bpf: add test cases for invalid thp_adjust usage Yafang Shao
@ 2025-08-26  7:19 ` Yafang Shao
  2025-08-26  7:19 ` [PATCH v6 mm-new 10/10] MAINTAINERS: add entry for BPF-based THP adjustment Yafang Shao
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2025-08-26  7:19 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet
  Cc: bpf, linux-mm, linux-doc, Yafang Shao

Add the documentation.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 Documentation/admin-guide/mm/transhuge.rst | 47 ++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index a16a04841b96..1725b89426a9 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -734,3 +734,50 @@ support enabled just fine as always. No difference can be noted in
 hugetlbfs other than there will be less overall fragmentation. All
 usual features belonging to hugetlbfs are preserved and
 unaffected. libhugetlbfs will also work fine as usual.
+
+BPF-based THP adjustment
+========================
+
+Overview
+--------
+
+When the system is configured with "always" or "madvise" THP mode, a BPF program
+can be used to adjust THP allocation policies dynamically. This enables
+fine-grained control over THP decisions based on various factors including
+workload identity, allocation context, and system memory pressure.
+
+Program Interface
+-----------------
+
+This feature implements a struct_ops BPF program with the following interface::
+
+  int (*get_suggested_order)(struct mm_struct *mm,
+                             struct vm_area_struct *vma__nullable,
+                             u64 vma_flags, enum tva_type tva_flags, int orders)
+
+Parameters::
+
+  @mm:  mm_struct associated with the THP allocation
+  @vma__nullable: vm_area_struct associated with the THP allocation (may be NULL)
+                  When NULL, the decision should be based on @mm (i.e., when
+                  triggered from an mm-scope hook rather than a VMA-specific
+                  context)
+                  Must belong to @mm (guaranteed by the caller).
+  @vma_flags: use these vm_flags instead of @vma->vm_flags (0 if @vma is NULL)
+  @tva_flags: TVA flags for current @vma (-1 if @vma is NULL)
+  @orders: Bitmask of requested THP orders for this allocation
+           - PMD-mapped allocation if PMD_ORDER is set
+           - mTHP allocation otherwise
+
+Return value::
+
+  Bitmask of suggested THP orders for allocation. The highest suggested order
+  will not exceed the highest requested order in @orders.
+
+Implementation Notes
+--------------------
+
+This is currently an experimental feature.
+CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION must be enabled to use it.
+Only one BPF program can be attached at a time, but the program can be updated
+dynamically to adjust policies without requiring affected tasks to be restarted.
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v6 mm-new 10/10] MAINTAINERS: add entry for BPF-based THP adjustment
  2025-08-26  7:19 [PATCH v6 mm-new 00/10] mm, bpf: BPF based THP order selection Yafang Shao
                   ` (8 preceding siblings ...)
  2025-08-26  7:19 ` [PATCH v6 mm-new 09/10] Documentation: add BPF-based THP adjustment documentation Yafang Shao
@ 2025-08-26  7:19 ` Yafang Shao
  2025-08-27 15:47   ` Lorenzo Stoakes
  2025-08-26  7:42 ` [PATCH v6 mm-new 00/10] mm, bpf: BPF based THP order selection David Hildenbrand
  2025-08-27 13:14 ` Lorenzo Stoakes
  11 siblings, 1 reply; 61+ messages in thread
From: Yafang Shao @ 2025-08-26  7:19 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet
  Cc: bpf, linux-mm, linux-doc, Yafang Shao

Add maintainership entry for the experimental BPF-driven THP adjustment
feature. This experimental component may be removed in future releases.
I will help with maintenance tasks for this feature during its development
lifecycle.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 MAINTAINERS | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 390829ae9803..71d0f7c58ce8 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16239,6 +16239,7 @@ F:	Documentation/admin-guide/mm/transhuge.rst
 F:	include/linux/huge_mm.h
 F:	include/linux/khugepaged.h
 F:	include/trace/events/huge_memory.h
+F:	mm/bpf_thp.c
 F:	mm/huge_memory.c
 F:	mm/khugepaged.c
 F:	mm/mm_slot.h
@@ -16246,6 +16247,15 @@ F:	tools/testing/selftests/mm/khugepaged.c
 F:	tools/testing/selftests/mm/split_huge_page_test.c
 F:	tools/testing/selftests/mm/transhuge-stress.c
 
+MEMORY MANAGEMENT - THP WITH BPF SUPPORT
+M:	Yafang Shao <laoar.shao@gmail.com>
+L:	bpf@vger.kernel.org
+L:	linux-mm@kvack.org
+S:	Maintained
+F:	mm/bpf_thp.c
+F:	tools/testing/selftests/bpf/prog_tests/thp_adjust.c
+F:	tools/testing/selftests/bpf/progs/test_thp_adjust*
+
 MEMORY MANAGEMENT - USERFAULTFD
 M:	Andrew Morton <akpm@linux-foundation.org>
 R:	Peter Xu <peterx@redhat.com>
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 00/10] mm, bpf: BPF based THP order selection
  2025-08-26  7:19 [PATCH v6 mm-new 00/10] mm, bpf: BPF based THP order selection Yafang Shao
                   ` (9 preceding siblings ...)
  2025-08-26  7:19 ` [PATCH v6 mm-new 10/10] MAINTAINERS: add entry for BPF-based THP adjustment Yafang Shao
@ 2025-08-26  7:42 ` David Hildenbrand
  2025-08-26  8:33   ` Lorenzo Stoakes
                     ` (2 more replies)
  2025-08-27 13:14 ` Lorenzo Stoakes
  11 siblings, 3 replies; 61+ messages in thread
From: David Hildenbrand @ 2025-08-26  7:42 UTC (permalink / raw)
  To: Yafang Shao, akpm, ziy, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, npache, ryan.roberts, dev.jain, hannes,
	usamaarif642, gutierrez.asier, willy, ast, daniel, andrii,
	ameryhung, rientjes, corbet
  Cc: bpf, linux-mm, linux-doc

On 26.08.25 09:19, Yafang Shao wrote:
> Background
> ==========
> 
> Our production servers consistently configure THP to "never" due to
> historical incidents caused by its behavior. Key issues include:
> - Increased Memory Consumption
>    THP significantly raises overall memory usage, reducing available memory
>    for workloads.
> 
> - Latency Spikes
>    Random latency spikes occur due to frequent memory compaction triggered
>    by THP.
> 
> - Lack of Fine-Grained Control
>    THP tuning is globally configured, making it unsuitable for containerized
>    environments. When multiple workloads share a host, enabling THP without
>    per-workload control leads to unpredictable behavior.
> 
> Due to these issues, administrators avoid switching to madvise or always
> modes—unless per-workload THP control is implemented.
> 
> To address this, we propose BPF-based THP policy for flexible adjustment.
> Additionally, as David mentioned [0], this mechanism can also serve as a
> policy prototyping tool (test policies via BPF before upstreaming them).

There is a lot going on and most reviewers (including me) are fairly 
busy right now, so getting more detailed review could take a while.

This topic sounds like a good candidate for the bi-weekly MM alignment 
session.

Would you be interested in presenting the current bpf interface, how to 
use it,  drawbacks, todos, ... in that forum?

David Rientjes, who organizes this meeting, is already on Cc.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 00/10] mm, bpf: BPF based THP order selection
  2025-08-26  7:42 ` [PATCH v6 mm-new 00/10] mm, bpf: BPF based THP order selection David Hildenbrand
@ 2025-08-26  8:33   ` Lorenzo Stoakes
  2025-08-26 12:06     ` Yafang Shao
  2025-08-26  9:52   ` Usama Arif
  2025-08-26 12:03   ` Yafang Shao
  2 siblings, 1 reply; 61+ messages in thread
From: Lorenzo Stoakes @ 2025-08-26  8:33 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Yafang Shao, akpm, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
	willy, ast, daniel, andrii, ameryhung, rientjes, corbet, bpf,
	linux-mm, linux-doc

On Tue, Aug 26, 2025 at 09:42:30AM +0200, David Hildenbrand wrote:
> On 26.08.25 09:19, Yafang Shao wrote:
> > Background
> > ==========
> >
> > Our production servers consistently configure THP to "never" due to
> > historical incidents caused by its behavior. Key issues include:
> > - Increased Memory Consumption
> >    THP significantly raises overall memory usage, reducing available memory
> >    for workloads.
> >
> > - Latency Spikes
> >    Random latency spikes occur due to frequent memory compaction triggered
> >    by THP.
> >
> > - Lack of Fine-Grained Control
> >    THP tuning is globally configured, making it unsuitable for containerized
> >    environments. When multiple workloads share a host, enabling THP without
> >    per-workload control leads to unpredictable behavior.
> >
> > Due to these issues, administrators avoid switching to madvise or always
> > modes—unless per-workload THP control is implemented.
> >
> > To address this, we propose BPF-based THP policy for flexible adjustment.
> > Additionally, as David mentioned [0], this mechanism can also serve as a
> > policy prototyping tool (test policies via BPF before upstreaming them).
>
> There is a lot going on and most reviewers (including me) are fairly busy
> right now, so getting more detailed review could take a while.
>
> This topic sounds like a good candidate for the bi-weekly MM alignment
> session.
>
> Would you be interested in presenting the current bpf interface, how to use
> it,  drawbacks, todos, ... in that forum?
>
> David Rientjes, who organizes this meeting, is already on Cc.

If we do this, would like an invite to it also!

Have been meaning to take a look into this in detail while in RFC but more so
now obviously :) as discussed in THP cabal, I am broadly in favour of this as
long we get the interface right.

Anyway let me have a look through...!

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 00/10] mm, bpf: BPF based THP order selection
  2025-08-26  7:42 ` [PATCH v6 mm-new 00/10] mm, bpf: BPF based THP order selection David Hildenbrand
  2025-08-26  8:33   ` Lorenzo Stoakes
@ 2025-08-26  9:52   ` Usama Arif
  2025-08-26 12:10     ` Yafang Shao
  2025-08-26 12:03   ` Yafang Shao
  2 siblings, 1 reply; 61+ messages in thread
From: Usama Arif @ 2025-08-26  9:52 UTC (permalink / raw)
  To: David Hildenbrand, Yafang Shao, akpm, ziy, baolin.wang,
	lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, dev.jain,
	hannes, gutierrez.asier, willy, ast, daniel, andrii, ameryhung,
	rientjes, corbet
  Cc: bpf, linux-mm, linux-doc



On 26/08/2025 08:42, David Hildenbrand wrote:
> On 26.08.25 09:19, Yafang Shao wrote:
>> Background
>> ==========
>>
>> Our production servers consistently configure THP to "never" due to
>> historical incidents caused by its behavior. Key issues include:
>> - Increased Memory Consumption
>>    THP significantly raises overall memory usage, reducing available memory
>>    for workloads.
>>
>> - Latency Spikes
>>    Random latency spikes occur due to frequent memory compaction triggered
>>    by THP.
>>
>> - Lack of Fine-Grained Control
>>    THP tuning is globally configured, making it unsuitable for containerized
>>    environments. When multiple workloads share a host, enabling THP without
>>    per-workload control leads to unpredictable behavior.
>>
>> Due to these issues, administrators avoid switching to madvise or always
>> modes—unless per-workload THP control is implemented.
>>
>> To address this, we propose BPF-based THP policy for flexible adjustment.
>> Additionally, as David mentioned [0], this mechanism can also serve as a
>> policy prototyping tool (test policies via BPF before upstreaming them).
> 
> There is a lot going on and most reviewers (including me) are fairly busy right now, so getting more detailed review could take a while.
> 
> This topic sounds like a good candidate for the bi-weekly MM alignment session.
> 
> Would you be interested in presenting the current bpf interface, how to use it,  drawbacks, todos, ... in that forum?
> 

Could I get an invite please? Thanks!

Usama

> David Rientjes, who organizes this meeting, is already on Cc.
> 



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 00/10] mm, bpf: BPF based THP order selection
  2025-08-26  7:42 ` [PATCH v6 mm-new 00/10] mm, bpf: BPF based THP order selection David Hildenbrand
  2025-08-26  8:33   ` Lorenzo Stoakes
  2025-08-26  9:52   ` Usama Arif
@ 2025-08-26 12:03   ` Yafang Shao
  2 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2025-08-26 12:03 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: akpm, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
	willy, ast, daniel, andrii, ameryhung, rientjes, corbet, bpf,
	linux-mm, linux-doc, yu c chen, Michal Koutný, libo.chen

On Tue, Aug 26, 2025 at 3:42 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 26.08.25 09:19, Yafang Shao wrote:
> > Background
> > ==========
> >
> > Our production servers consistently configure THP to "never" due to
> > historical incidents caused by its behavior. Key issues include:
> > - Increased Memory Consumption
> >    THP significantly raises overall memory usage, reducing available memory
> >    for workloads.
> >
> > - Latency Spikes
> >    Random latency spikes occur due to frequent memory compaction triggered
> >    by THP.
> >
> > - Lack of Fine-Grained Control
> >    THP tuning is globally configured, making it unsuitable for containerized
> >    environments. When multiple workloads share a host, enabling THP without
> >    per-workload control leads to unpredictable behavior.
> >
> > Due to these issues, administrators avoid switching to madvise or always
> > modes—unless per-workload THP control is implemented.
> >
> > To address this, we propose BPF-based THP policy for flexible adjustment.
> > Additionally, as David mentioned [0], this mechanism can also serve as a
> > policy prototyping tool (test policies via BPF before upstreaming them).
>
> There is a lot going on and most reviewers (including me) are fairly
> busy right now, so getting more detailed review could take a while.
>
> This topic sounds like a good candidate for the bi-weekly MM alignment
> session.
>
> Would you be interested in presenting the current bpf interface, how to
> use it,  drawbacks, todos, ... in that forum?

Sure.

>
> David Rientjes, who organizes this meeting, is already on Cc.

DavidR had previously reached out to me about this patchset.

Hello DavidR,

Would September 17 from 9:00–10:00 AM PDT (UTC-7) work for discussing
this topic? If that time isn’t convenient, I’m happy to schedule a
later session—this will also give me some time to prepare a brief
slide.

On a related note, I’d like to take this opportunity to also share a
short proposal on BPF-based NUMA balancing.

On our AMD EPYC servers, many services experience significant
performance degradation due to cross-NUMA access. While NUMA balancing
can help mitigate this, its current global enable/disable
implementation often leads to overall system performance regression.
We are exploring the use of BPF to selectively enable NUMA balancing
only for NUMA-sensitive services, thereby minimizing unintended side
effects. A similar approach has been proposed in [0] using prctl() or
a cgroup interface. We believe this use case is particularly
well-suited for a BPF-based solution, and I’ll briefly outline why in
the slide. I’ve included the developers from [0] in CC for visibility,
in case they are interested in joining the discussion.

Looking forward to your thoughts.

[0]. https://lore.kernel.org/lkml/20250625102337.3128193-1-yu.c.chen@intel.com/

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 00/10] mm, bpf: BPF based THP order selection
  2025-08-26  8:33   ` Lorenzo Stoakes
@ 2025-08-26 12:06     ` Yafang Shao
  0 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2025-08-26 12:06 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand, akpm, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
	willy, ast, daniel, andrii, ameryhung, rientjes, corbet, bpf,
	linux-mm, linux-doc

On Tue, Aug 26, 2025 at 4:33 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Tue, Aug 26, 2025 at 09:42:30AM +0200, David Hildenbrand wrote:
> > On 26.08.25 09:19, Yafang Shao wrote:
> > > Background
> > > ==========
> > >
> > > Our production servers consistently configure THP to "never" due to
> > > historical incidents caused by its behavior. Key issues include:
> > > - Increased Memory Consumption
> > >    THP significantly raises overall memory usage, reducing available memory
> > >    for workloads.
> > >
> > > - Latency Spikes
> > >    Random latency spikes occur due to frequent memory compaction triggered
> > >    by THP.
> > >
> > > - Lack of Fine-Grained Control
> > >    THP tuning is globally configured, making it unsuitable for containerized
> > >    environments. When multiple workloads share a host, enabling THP without
> > >    per-workload control leads to unpredictable behavior.
> > >
> > > Due to these issues, administrators avoid switching to madvise or always
> > > modes—unless per-workload THP control is implemented.
> > >
> > > To address this, we propose BPF-based THP policy for flexible adjustment.
> > > Additionally, as David mentioned [0], this mechanism can also serve as a
> > > policy prototyping tool (test policies via BPF before upstreaming them).
> >
> > There is a lot going on and most reviewers (including me) are fairly busy
> > right now, so getting more detailed review could take a while.
> >
> > This topic sounds like a good candidate for the bi-weekly MM alignment
> > session.
> >
> > Would you be interested in presenting the current bpf interface, how to use
> > it,  drawbacks, todos, ... in that forum?
> >
> > David Rientjes, who organizes this meeting, is already on Cc.
>
> If we do this, would like an invite to it also!
>
> Have been meaning to take a look into this in detail while in RFC but more so
> now obviously :) as discussed in THP cabal, I am broadly in favour of this as
> long we get the interface right.
>
> Anyway let me have a look through...!

Thanks in advance.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 00/10] mm, bpf: BPF based THP order selection
  2025-08-26  9:52   ` Usama Arif
@ 2025-08-26 12:10     ` Yafang Shao
  0 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2025-08-26 12:10 UTC (permalink / raw)
  To: Usama Arif
  Cc: David Hildenbrand, akpm, ziy, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, npache, ryan.roberts, dev.jain, hannes,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, bpf, linux-mm, linux-doc

On Tue, Aug 26, 2025 at 5:52 PM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 26/08/2025 08:42, David Hildenbrand wrote:
> > On 26.08.25 09:19, Yafang Shao wrote:
> >> Background
> >> ==========
> >>
> >> Our production servers consistently configure THP to "never" due to
> >> historical incidents caused by its behavior. Key issues include:
> >> - Increased Memory Consumption
> >>    THP significantly raises overall memory usage, reducing available memory
> >>    for workloads.
> >>
> >> - Latency Spikes
> >>    Random latency spikes occur due to frequent memory compaction triggered
> >>    by THP.
> >>
> >> - Lack of Fine-Grained Control
> >>    THP tuning is globally configured, making it unsuitable for containerized
> >>    environments. When multiple workloads share a host, enabling THP without
> >>    per-workload control leads to unpredictable behavior.
> >>
> >> Due to these issues, administrators avoid switching to madvise or always
> >> modes—unless per-workload THP control is implemented.
> >>
> >> To address this, we propose BPF-based THP policy for flexible adjustment.
> >> Additionally, as David mentioned [0], this mechanism can also serve as a
> >> policy prototyping tool (test policies via BPF before upstreaming them).
> >
> > There is a lot going on and most reviewers (including me) are fairly busy right now, so getting more detailed review could take a while.
> >
> > This topic sounds like a good candidate for the bi-weekly MM alignment session.
> >
> > Would you be interested in presenting the current bpf interface, how to use it,  drawbacks, todos, ... in that forum?
> >
>
> Could I get an invite please? Thanks!

IIUC, a Google Meet link will be shared publicly on the upstream
mailing list, allowing any interested developers to join. If the
session is invite-only instead, I will make sure you receive an
invitation, given the significant help you've already provided for
this series.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 01/10] mm: thp: add support for BPF based THP order selection
  2025-08-26  7:19 ` [PATCH v6 mm-new 01/10] mm: thp: add support for " Yafang Shao
@ 2025-08-27  2:57   ` kernel test robot
  2025-08-27 11:39     ` Yafang Shao
  2025-08-27 15:03   ` Lorenzo Stoakes
  2025-08-29  4:56   ` Barry Song
  2 siblings, 1 reply; 61+ messages in thread
From: kernel test robot @ 2025-08-27  2:57 UTC (permalink / raw)
  To: Yafang Shao, akpm, david, ziy, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, npache, ryan.roberts, dev.jain, hannes,
	usamaarif642, gutierrez.asier, willy, ast, daniel, andrii,
	ameryhung, rientjes, corbet
  Cc: oe-kbuild-all, bpf, linux-mm, linux-doc, Yafang Shao

Hi Yafang,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Yafang-Shao/mm-thp-add-support-for-BPF-based-THP-order-selection/20250826-152415
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20250826071948.2618-2-laoar.shao%40gmail.com
patch subject: [PATCH v6 mm-new 01/10] mm: thp: add support for BPF based THP order selection
config: loongarch-randconfig-r113-20250827 (https://download.01.org/0day-ci/archive/20250827/202508271009.5neOZ0OG-lkp@intel.com/config)
compiler: clang version 18.1.8 (https://github.com/llvm/llvm-project 3b5b5c1ec4a3095ab096dd780e84d7ab81f3d7ff)
reproduce: (https://download.01.org/0day-ci/archive/20250827/202508271009.5neOZ0OG-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202508271009.5neOZ0OG-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> mm/bpf_thp.c:47:31: sparse: sparse: incompatible types in comparison expression (different address spaces):
   mm/bpf_thp.c:47:31: sparse:    int ( [noderef] __rcu * )( ... )
   mm/bpf_thp.c:47:31: sparse:    int ( * )( ... )
   mm/bpf_thp.c:101:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   mm/bpf_thp.c:101:9: sparse:    int ( [noderef] __rcu * )( ... )
   mm/bpf_thp.c:101:9: sparse:    int ( * )( ... )
   mm/bpf_thp.c:102:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   mm/bpf_thp.c:102:9: sparse:    int ( [noderef] __rcu * )( ... )
   mm/bpf_thp.c:102:9: sparse:    int ( * )( ... )
   mm/bpf_thp.c:111:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   mm/bpf_thp.c:111:9: sparse:    int ( [noderef] __rcu * )( ... )
   mm/bpf_thp.c:111:9: sparse:    int ( * )( ... )
   mm/bpf_thp.c:112:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   mm/bpf_thp.c:112:9: sparse:    int ( [noderef] __rcu * )( ... )
   mm/bpf_thp.c:112:9: sparse:    int ( * )( ... )
   mm/bpf_thp.c:112:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   mm/bpf_thp.c:112:9: sparse:    int ( [noderef] __rcu * )( ... )
   mm/bpf_thp.c:112:9: sparse:    int ( * )( ... )
   mm/bpf_thp.c:133:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   mm/bpf_thp.c:133:9: sparse:    int ( [noderef] __rcu * )( ... )
   mm/bpf_thp.c:133:9: sparse:    int ( * )( ... )
   mm/bpf_thp.c:134:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   mm/bpf_thp.c:134:9: sparse:    int ( [noderef] __rcu * )( ... )
   mm/bpf_thp.c:134:9: sparse:    int ( * )( ... )
   mm/bpf_thp.c:134:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   mm/bpf_thp.c:134:9: sparse:    int ( [noderef] __rcu * )( ... )
   mm/bpf_thp.c:134:9: sparse:    int ( * )( ... )
>> mm/bpf_thp.c:102:9: sparse: sparse: dereference of noderef expression
>> mm/bpf_thp.c:102:9: sparse: sparse: dereference of noderef expression
   mm/bpf_thp.c:112:9: sparse: sparse: dereference of noderef expression
   mm/bpf_thp.c:134:9: sparse: sparse: dereference of noderef expression
   mm/bpf_thp.c:134:9: sparse: sparse: dereference of noderef expression
   mm/bpf_thp.c:134:9: sparse: sparse: dereference of noderef expression
   mm/bpf_thp.c:148:14: sparse: sparse: dereference of noderef expression

vim +47 mm/bpf_thp.c

    33	
    34	int get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
    35				u64 vma_flags, enum tva_type tva_flags, int orders)
    36	{
    37		int (*bpf_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
    38					   u64 vma_flags, enum tva_type tva_flags, int orders);
    39		int suggested_orders = orders;
    40	
    41		/* No BPF program is attached */
    42		if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
    43			      &transparent_hugepage_flags))
    44			return suggested_orders;
    45	
    46		rcu_read_lock();
  > 47		bpf_suggested_order = rcu_dereference(bpf_thp.get_suggested_order);
    48		if (!bpf_suggested_order)
    49			goto out;
    50	
    51		suggested_orders = bpf_suggested_order(mm, vma__nullable, vma_flags, tva_flags, orders);
    52		if (highest_order(suggested_orders) > highest_order(orders))
    53			suggested_orders = orders;
    54	
    55	out:
    56		rcu_read_unlock();
    57		return suggested_orders;
    58	}
    59	
    60	static bool bpf_thp_ops_is_valid_access(int off, int size,
    61						enum bpf_access_type type,
    62						const struct bpf_prog *prog,
    63						struct bpf_insn_access_aux *info)
    64	{
    65		return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
    66	}
    67	
    68	static const struct bpf_func_proto *
    69	bpf_thp_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
    70	{
    71		return bpf_base_func_proto(func_id, prog);
    72	}
    73	
    74	static const struct bpf_verifier_ops thp_bpf_verifier_ops = {
    75		.get_func_proto = bpf_thp_get_func_proto,
    76		.is_valid_access = bpf_thp_ops_is_valid_access,
    77	};
    78	
    79	static int bpf_thp_init(struct btf *btf)
    80	{
    81		return 0;
    82	}
    83	
    84	static int bpf_thp_init_member(const struct btf_type *t,
    85				       const struct btf_member *member,
    86				       void *kdata, const void *udata)
    87	{
    88		return 0;
    89	}
    90	
    91	static int bpf_thp_reg(void *kdata, struct bpf_link *link)
    92	{
    93		struct bpf_thp_ops *ops = kdata;
    94	
    95		spin_lock(&thp_ops_lock);
    96		if (test_and_set_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
    97				     &transparent_hugepage_flags)) {
    98			spin_unlock(&thp_ops_lock);
    99			return -EBUSY;
   100		}
   101		WARN_ON_ONCE(rcu_access_pointer(bpf_thp.get_suggested_order));
 > 102		rcu_assign_pointer(bpf_thp.get_suggested_order, ops->get_suggested_order);
   103		spin_unlock(&thp_ops_lock);
   104		return 0;
   105	}
   106	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 01/10] mm: thp: add support for BPF based THP order selection
  2025-08-27  2:57   ` kernel test robot
@ 2025-08-27 11:39     ` Yafang Shao
  2025-08-27 15:04       ` Lorenzo Stoakes
  0 siblings, 1 reply; 61+ messages in thread
From: Yafang Shao @ 2025-08-27 11:39 UTC (permalink / raw)
  To: kernel test robot
  Cc: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, oe-kbuild-all, bpf, linux-mm, linux-doc

On Wed, Aug 27, 2025 at 10:58 AM kernel test robot <lkp@intel.com> wrote:
>
> Hi Yafang,
>
> kernel test robot noticed the following build warnings:
>
> [auto build test WARNING on akpm-mm/mm-everything]
>
> url:    https://github.com/intel-lab-lkp/linux/commits/Yafang-Shao/mm-thp-add-support-for-BPF-based-THP-order-selection/20250826-152415
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> patch link:    https://lore.kernel.org/r/20250826071948.2618-2-laoar.shao%40gmail.com
> patch subject: [PATCH v6 mm-new 01/10] mm: thp: add support for BPF based THP order selection
> config: loongarch-randconfig-r113-20250827 (https://download.01.org/0day-ci/archive/20250827/202508271009.5neOZ0OG-lkp@intel.com/config)
> compiler: clang version 18.1.8 (https://github.com/llvm/llvm-project 3b5b5c1ec4a3095ab096dd780e84d7ab81f3d7ff)
> reproduce: (https://download.01.org/0day-ci/archive/20250827/202508271009.5neOZ0OG-lkp@intel.com/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202508271009.5neOZ0OG-lkp@intel.com/

Thanks for the report .
It seems this sparse warning can be fixed with the below additional
change, would you please test it again?

diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
index 46b3bc96359e..b2f97f9e930d 100644
--- a/mm/bpf_thp.c
+++ b/mm/bpf_thp.c
@@ -5,27 +5,32 @@
 #include <linux/huge_mm.h>
 #include <linux/khugepaged.h>

+/**
+ * @get_suggested_order: Get the suggested THP orders for allocation
+ * @mm: mm_struct associated with the THP allocation
+ * @vma__nullable: vm_area_struct associated with the THP allocation
(may be NULL)
+ *                 When NULL, the decision should be based on @mm (i.e., when
+ *                 triggered from an mm-scope hook rather than a VMA-specific
+ *                 context).
+ *                 Must belong to @mm (guaranteed by the caller).
+ * @vma_flags: use these vm_flags instead of @vma->vm_flags (0 if @vma is NULL)
+ * @tva_flags: TVA flags for current @vma (-1 if @vma is NULL)
+ * @orders: Bitmask of requested THP orders for this allocation
+ *          - PMD-mapped allocation if PMD_ORDER is set
+ *          - mTHP allocation otherwise
+ *
+ * Rerurn: Bitmask of suggested THP orders for allocation. The highest
+ *         suggested order will not exceed the highest requested order
+ *         in @orders.
+ */
+typedef int suggested_order_fn_t(struct mm_struct *mm,
+                                struct vm_area_struct *vma__nullable,
+                                u64 vma_flags,
+                                enum tva_type tva_flags,
+                                int orders);
+
 struct bpf_thp_ops {
-       /**
-        * @get_suggested_order: Get the suggested THP orders for allocation
-        * @mm: mm_struct associated with the THP allocation
-        * @vma__nullable: vm_area_struct associated with the THP
allocation (may be NULL)
-        *                 When NULL, the decision should be based on
@mm (i.e., when
-        *                 triggered from an mm-scope hook rather than
a VMA-specific
-        *                 context).
-        *                 Must belong to @mm (guaranteed by the caller).
-        * @vma_flags: use these vm_flags instead of @vma->vm_flags (0
if @vma is NULL)
-        * @tva_flags: TVA flags for current @vma (-1 if @vma is NULL)
-        * @orders: Bitmask of requested THP orders for this allocation
-        *          - PMD-mapped allocation if PMD_ORDER is set
-        *          - mTHP allocation otherwise
-        *
-        * Rerurn: Bitmask of suggested THP orders for allocation. The highest
-        *         suggested order will not exceed the highest requested order
-        *         in @orders.
-        */
-       int (*get_suggested_order)(struct mm_struct *mm, struct
vm_area_struct *vma__nullable,
-                                  u64 vma_flags, enum tva_type
tva_flags, int orders) __rcu;
+       suggested_order_fn_t __rcu *get_suggested_order;
 };

 static struct bpf_thp_ops bpf_thp;
@@ -34,8 +39,7 @@ static DEFINE_SPINLOCK(thp_ops_lock);
 int get_suggested_order(struct mm_struct *mm, struct vm_area_struct
*vma__nullable,
                        u64 vma_flags, enum tva_type tva_flags, int orders)
 {
-       int (*bpf_suggested_order)(struct mm_struct *mm, struct
vm_area_struct *vma__nullable,
-                                  u64 vma_flags, enum tva_type
tva_flags, int orders);
+       suggested_order_fn_t *bpf_suggested_order;
        int suggested_orders = orders;

        /* No BPF program is attached */
@@ -106,10 +110,12 @@ static int bpf_thp_reg(void *kdata, struct bpf_link *link)

 static void bpf_thp_unreg(void *kdata, struct bpf_link *link)
 {
+       suggested_order_fn_t *old_fn;
+
        spin_lock(&thp_ops_lock);
        clear_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
&transparent_hugepage_flags);
-       WARN_ON_ONCE(!rcu_access_pointer(bpf_thp.get_suggested_order));
-       rcu_replace_pointer(bpf_thp.get_suggested_order, NULL,
lockdep_is_held(&thp_ops_lock));
+       old_fn = rcu_replace_pointer(bpf_thp.get_suggested_order,
NULL, lockdep_is_held(&thp_ops_lock));
+       WARN_ON_ONCE(!old_fn);
        spin_unlock(&thp_ops_lock);

        synchronize_rcu();
@@ -117,8 +123,9 @@ static void bpf_thp_unreg(void *kdata, struct
bpf_link *link)

 static int bpf_thp_update(void *kdata, void *old_kdata, struct bpf_link *link)
 {
-       struct bpf_thp_ops *ops = kdata;
+       suggested_order_fn_t *old_fn, *new_fn;
        struct bpf_thp_ops *old = old_kdata;
+       struct bpf_thp_ops *ops = kdata;
        int ret = 0;

        if (!ops || !old)
@@ -130,9 +137,10 @@ static int bpf_thp_update(void *kdata, void
*old_kdata, struct bpf_link *link)
                ret = -ENOENT;
                goto out;
        }
-       WARN_ON_ONCE(!rcu_access_pointer(bpf_thp.get_suggested_order));
-       rcu_replace_pointer(bpf_thp.get_suggested_order,
ops->get_suggested_order,
-                           lockdep_is_held(&thp_ops_lock));
+
+       new_fn = rcu_dereference(ops->get_suggested_order);
+       old_fn = rcu_replace_pointer(bpf_thp.get_suggested_order,
new_fn, lockdep_is_held(&thp_ops_lock));
+       WARN_ON_ONCE(!old_fn || !new_fn);

 out:
        spin_unlock(&thp_ops_lock);
@@ -159,7 +167,7 @@ static int suggested_order(struct mm_struct *mm,
struct vm_area_struct *vma__nul
 }

 static struct bpf_thp_ops __bpf_thp_ops = {
-       .get_suggested_order = suggested_order,
+       .get_suggested_order = (suggested_order_fn_t __rcu *)suggested_order,
 };

 static struct bpf_struct_ops bpf_bpf_thp_ops = {


--
Regards

Yafang


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 00/10] mm, bpf: BPF based THP order selection
  2025-08-26  7:19 [PATCH v6 mm-new 00/10] mm, bpf: BPF based THP order selection Yafang Shao
                   ` (10 preceding siblings ...)
  2025-08-26  7:42 ` [PATCH v6 mm-new 00/10] mm, bpf: BPF based THP order selection David Hildenbrand
@ 2025-08-27 13:14 ` Lorenzo Stoakes
  2025-08-28  2:58   ` Yafang Shao
  11 siblings, 1 reply; 61+ messages in thread
From: Lorenzo Stoakes @ 2025-08-27 13:14 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, bpf, linux-mm,
	linux-doc

On Tue, Aug 26, 2025 at 03:19:38PM +0800, Yafang Shao wrote:
> Background
> ==========
>
> Our production servers consistently configure THP to "never" due to
> historical incidents caused by its behavior. Key issues include:
> - Increased Memory Consumption
>   THP significantly raises overall memory usage, reducing available memory
>   for workloads.
>
> - Latency Spikes
>   Random latency spikes occur due to frequent memory compaction triggered
>   by THP.
>
> - Lack of Fine-Grained Control
>   THP tuning is globally configured, making it unsuitable for containerized
>   environments. When multiple workloads share a host, enabling THP without
>   per-workload control leads to unpredictable behavior.
>
> Due to these issues, administrators avoid switching to madvise or always
> modes—unless per-workload THP control is implemented.
>
> To address this, we propose BPF-based THP policy for flexible adjustment.
> Additionally, as David mentioned [0], this mechanism can also serve as a
> policy prototyping tool (test policies via BPF before upstreaming them).

I think it's important to highlight here that we are exploring an _experimental_
implementation.

>
> Proposed Solution
> =================
>
> As suggested by David [0], we introduce a new BPF interface:

I do agree, to be clear, with this broad approach - that is, to provide the
minimum information that a reasonable decision can be made upon and to keep
things as simple as we can.

As per the THP cabal (I think? :) the general consensus was in line with
this.


>
> /**
>  * @get_suggested_order: Get the suggested THP orders for allocation
>  * @mm: mm_struct associated with the THP allocation
>  * @vma__nullable: vm_area_struct associated with the THP allocation (may be NULL)
>  *                 When NULL, the decision should be based on @mm (i.e., when
>  *                 triggered from an mm-scope hook rather than a VMA-specific
>  *                 context).

I'm a little wary of handing a VMA to BPF, under what locking would it be
provided?

>  *                 Must belong to @mm (guaranteed by the caller).
>  * @vma_flags: use these vm_flags instead of @vma->vm_flags (0 if @vma is NULL)

Hmm this one is also a bit odd - why would these flags differ? Note that I will
be changing the VMA flags to a bitmap relatively soon which may be larger than
the system word size.

So 'handing around all the flags' is something we probably want to avoid.

For the f_op->mmap_prepare stuff I provided an abstraction

>  * @tva_flags: TVA flags for current @vma (-1 if @vma is NULL)
>  * @orders: Bitmask of requested THP orders for this allocation
>  *          - PMD-mapped allocation if PMD_ORDER is set
>  *          - mTHP allocation otherwise
>  *
>  * Rerurn: Bitmask of suggested THP orders for allocation. The highest

Obv. a cover letter thing but typo her :P rerurn -> return.

>  *         suggested order will not exceed the highest requested order
>  *         in @orders.

In what sense are they 'suggested'? Is this a product of sysfs settings or? I
think this needs to be clearer.

>  */
>  int (*get_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
>                             u64 vma_flags, enum tva_type tva_flags, int orders) __rcu;

Also here in what sense is this suggested? :)

>
> This interface:
> - Supports both use cases (per-workload tuning + policy prototyping).
> - Can be extended with BPF helpers (e.g., for memory pressure awareness).

Hm how would extensions like this work?

>
> This is an experimental feature. To use it, you must enable
> CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION.

Yes! Thanks. I am glad we are putting this behind a config flag.

>
> Warning:
> - The interface may change
> - Behavior may differ in future kernel versions
> - We might remove it in the future
>
>
> Selftests
> =========
>
> BPF selftests
> -------------
>
> Patch #5: Implements a basic BPF THP policy that restricts THP allocation
>           via khugepaged to tasks within a specified memory cgroup.
> Patch #6: Contains test cases validating the khugepaged fork behavior.
> Patch #7: Provides tests for dynamic BPF program updates and replacement.
> Patch #8: Includes negative tests for invalid BPF helper usage, verifying
>           proper verification by the BPF verifier.
>
> Currently, several dependency patches reside in mm-new but haven't been
> merged into bpf-next:
>   mm: add bitmap mm->flags field
>   mm/huge_memory: convert "tva_flags" to "enum tva_type"
>   mm: convert core mm to mm_flags_*() accessors
>
> To enable BPF CI testing, these dependencies were manually applied to
> bpf-next [1]. All selftests in this series pass successfully. The observed
> CI failures are unrelated to these changes.

Cool, glad at least my mm changes were ok :)

>
> Performance Evaluation
> ----------------------
>
> As suggested by Usama [2], performance impact was measured given the page
> fault handler modifications. The standard `perf bench mem memset` benchmark
> was employed to assess page fault performance.
>
> Testing was conducted on an AMD EPYC 7W83 64-Core Processor (single NUMA
> node). Due to variance between individual test runs, a script executed
> 10000 iterations to calculate meaningful averages and standard deviations.
>
> The results across three configurations show negligible performance impact:
> - Baseline (without this patch series)
> - With patch series but no BPF program attached
> - With patch series and BPF program attached
>
> The result are as follows,
>
>   Number of runs: 10,000
>   Average throughput: 40-41 GB/sec
>   Standard deviation: 7-8 GB/sec

You're not giving data comparing the 3? Could you do so? Thanks.

>
> Production verification
> -----------------------
>
> We have successfully deployed a variant of this approach across numerous
> Kubernetes production servers. The implementation enables THP for specific
> workloads (such as applications utilizing ZGC [3]) while disabling it for
> others. This selective deployment has operated flawlessly, with no
> regression reports to date.
>
> For ZGC-based applications, our verification demonstrates that shmem THP
> delivers significant improvements:
> - Reduced CPU utilization
> - Lower average latencies

Obviously it's _really key_ to point out that this feature is intendend to
be _absolutely_ ephemeral - we may or may not implement something like this
- it's really about both exploring how such an interface might look and
also helping to determine how an 'automagic' future might look.

>
> Future work
> ===========
>
> Based on our validation with production workloads, we observed mixed
> results with XFS large folios (also known as File THP):
>
> - Performance Benefits
>   Some workloads demonstrated significant improvements with XFS large
>   folios enabled
> - Performance Regression
>   Some workloads experienced degradation when using XFS large folios
>
> These results demonstrate that File THP, similar to anonymous THP, requires
> a more granular approach instead of a uniform implementation.
>
> We will extend the BPF-based order selection mechanism to support File THP
> allocation policies.
>
> Link: https://lwn.net/ml/all/9bc57721-5287-416c-aa30-46932d605f63@redhat.com/ [0]
> Link: https://github.com/kernel-patches/bpf/pull/9561 [1]
> Link: https://lwn.net/ml/all/a24d632d-4b11-4c88-9ed0-26fa12a0fce4@gmail.com/ [2]
> Link: https://wiki.openjdk.org/display/zgc/Main#Main-EnablingTransparentHugePagesOnLinux [3]
>
> Changes:
> =======
>
> RFC v5-> v6:
> - Code improvement around the RCU usage (Usama)
> - Add selftests for khugepaged fork (Usama)
> - Add performance data for page fault (Usama)
> - Remove the RFC tag
>

Sorry I haven't been involved in the RFC reviews, always intended to but
workload etc.

Will be looking through this series as very interested in exploring this
approach.

Cheers, Lorenzo

> RFC v4->v5: https://lwn.net/Articles/1034265/
> - Add support for vma (David)
> - Add mTHP support in khugepaged (Zi)
> - Use bitmask of all allowed orders instead (Zi)
> - Retrieve the page size and PMD order rather than hardcoding them (Zi)
>
> RFC v3->v4: https://lwn.net/Articles/1031829/
> - Use a new interface get_suggested_order() (David)
> - Mark it as experimental (David, Lorenzo)
> - Code improvement in THP (Usama)
> - Code improvement in BPF struct ops (Amery)
>
> RFC v2->v3: https://lwn.net/Articles/1024545/
> - Finer-graind tuning based on madvise or always mode (David, Lorenzo)
> - Use BPF to write more advanced policies logic (David, Lorenzo)
>
> RFC v1->v2: https://lwn.net/Articles/1021783/
> The main changes are as follows,
> - Use struct_ops instead of fmod_ret (Alexei)
> - Introduce a new THP mode (Johannes)
> - Introduce new helpers for BPF hook (Zi)
> - Refine the commit log
>
> RFC v1: https://lwn.net/Articles/1019290/
>
> Yafang Shao (10):
>   mm: thp: add support for BPF based THP order selection
>   mm: thp: add a new kfunc bpf_mm_get_mem_cgroup()
>   mm: thp: add a new kfunc bpf_mm_get_task()
>   bpf: mark vma->vm_mm as trusted
>   selftests/bpf: add a simple BPF based THP policy
>   selftests/bpf: add test case for khugepaged fork
>   selftests/bpf: add test case to update thp policy
>   selftests/bpf: add test cases for invalid thp_adjust usage
>   Documentation: add BPF-based THP adjustment documentation
>   MAINTAINERS: add entry for BPF-based THP adjustment
>
>  Documentation/admin-guide/mm/transhuge.rst    |  47 +++
>  MAINTAINERS                                   |  10 +
>  include/linux/huge_mm.h                       |  15 +
>  include/linux/khugepaged.h                    |  12 +-
>  kernel/bpf/verifier.c                         |   5 +
>  mm/Kconfig                                    |  12 +
>  mm/Makefile                                   |   1 +
>  mm/bpf_thp.c                                  | 269 ++++++++++++++
>  mm/huge_memory.c                              |  10 +
>  mm/khugepaged.c                               |  26 +-
>  mm/memory.c                                   |  18 +-
>  tools/testing/selftests/bpf/config            |   3 +
>  .../selftests/bpf/prog_tests/thp_adjust.c     | 343 ++++++++++++++++++
>  .../selftests/bpf/progs/test_thp_adjust.c     | 115 ++++++
>  .../bpf/progs/test_thp_adjust_trusted_vma.c   |  27 ++
>  .../progs/test_thp_adjust_unreleased_memcg.c  |  24 ++
>  .../progs/test_thp_adjust_unreleased_task.c   |  25 ++
>  17 files changed, 955 insertions(+), 7 deletions(-)
>  create mode 100644 mm/bpf_thp.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_trusted_vma.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_unreleased_memcg.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_unreleased_task.c
>
> --
> 2.47.3
>


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 01/10] mm: thp: add support for BPF based THP order selection
  2025-08-26  7:19 ` [PATCH v6 mm-new 01/10] mm: thp: add support for " Yafang Shao
  2025-08-27  2:57   ` kernel test robot
@ 2025-08-27 15:03   ` Lorenzo Stoakes
  2025-08-28  5:54     ` Yafang Shao
  2025-08-29  4:56   ` Barry Song
  2 siblings, 1 reply; 61+ messages in thread
From: Lorenzo Stoakes @ 2025-08-27 15:03 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, bpf, linux-mm,
	linux-doc

On Tue, Aug 26, 2025 at 03:19:39PM +0800, Yafang Shao wrote:
> This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
> THP tuning. It includes a hook get_suggested_order() [0], allowing BPF
> programs to influence THP order selection based on factors such as:
> - Workload identity
>   For example, workloads running in specific containers or cgroups.
> - Allocation context
>   Whether the allocation occurs during a page fault, khugepaged, or other
>   paths.
> - System memory pressure
>   (May require new BPF helpers to accurately assess memory pressure.)
>
> Key Details:
> - Only one BPF program can be attached at a time, but it can be updated
>   dynamically to adjust the policy.
> - Supports automatic mTHP order selection and per-workload THP policies.
> - Only functional when THP is set to madise or always.
>
> It requires CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION to enable. [1]
> This feature is unstable and may evolve in future kernel versions.
>
> Link: https://lwn.net/ml/all/9bc57721-5287-416c-aa30-46932d605f63@redhat.com/ [0]
> Link: https://lwn.net/ml/all/dda67ea5-2943-497c-a8e5-d81f0733047d@lucifer.local/ [1]
>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  include/linux/huge_mm.h    |  15 +++
>  include/linux/khugepaged.h |  12 ++-
>  mm/Kconfig                 |  12 +++
>  mm/Makefile                |   1 +
>  mm/bpf_thp.c               | 186 +++++++++++++++++++++++++++++++++++++

Please add new files to MAINTAINERS as you add them.

>  mm/huge_memory.c           |  10 ++
>  mm/khugepaged.c            |  26 +++++-
>  mm/memory.c                |  18 +++-
>  8 files changed, 273 insertions(+), 7 deletions(-)
>  create mode 100644 mm/bpf_thp.c
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 1ac0d06fb3c1..f0c91d7bd267 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -6,6 +6,8 @@
>
>  #include <linux/fs.h> /* only for vma_is_dax() */
>  #include <linux/kobject.h>
> +#include <linux/pgtable.h>
> +#include <linux/mm.h>

Hm this is a bit weird as mm.h includes huge_mm... I guess it will be handled by
header defines but still.

>
>  vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
>  int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> @@ -56,6 +58,7 @@ enum transparent_hugepage_flag {
>  	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
>  	TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
>  	TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
> +	TRANSPARENT_HUGEPAGE_BPF_ATTACHED,      /* BPF prog is attached */
>  };
>
>  struct kobject;
> @@ -195,6 +198,18 @@ static inline bool hugepage_global_always(void)
>  			(1<<TRANSPARENT_HUGEPAGE_FLAG);
>  }
>
> +#ifdef CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION
> +int get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> +			u64 vma_flags, enum tva_type tva_flags, int orders);

Not a massive fan of this naming to be honest. I think it should explicitly
reference bpf, e.g. bpf_hook_thp_get_order() or something.

Right now this is super unclear as to what it's for.

Also wrt vma_flags - this type is wrong :) it's vm_flags_t and going to change
to a bitmap of unlimiiteeed size soon. So probs best not to pass around as value
type either.

But unclear us to purpose as mentioned elsewhere.

And also get_suggested_order() should be get_suggested_orderS() no? As you
seem later in the code to be referencing a bitfield?

Also will mm ever != vma->vm_mm?

Are we hacking this for the sake of overloading what this does?

Also if we're returning a bitmask of orders which you seem to be (not sure I
like that tbh - I feel like we shoudl simply provide one order but open for
disucssion) - shouldn't it return an unsigned long?

> +#else
> +static inline int
> +get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> +		    u64 vma_flags, enum tva_type tva_flags, int orders)
> +{
> +	return orders;
> +}
> +#endif
> +
>  static inline int highest_order(unsigned long orders)
>  {
>  	return fls_long(orders) - 1;
> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> index eb1946a70cff..d81c1228a21f 100644
> --- a/include/linux/khugepaged.h
> +++ b/include/linux/khugepaged.h
> @@ -4,6 +4,8 @@
>
>  #include <linux/mm.h>
>
> +#include <linux/huge_mm.h>
> +

Hm this is iffy too, There's probably a reason we didn't include this before,
the headers can be so so fragile. Let's be cautious...

>  extern unsigned int khugepaged_max_ptes_none __read_mostly;
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  extern struct attribute_group khugepaged_attr_group;
> @@ -22,7 +24,15 @@ extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>
>  static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
>  {
> -	if (mm_flags_test(MMF_VM_HUGEPAGE, oldmm))
> +	/*
> +	 * THP allocation policy can be dynamically modified via BPF. Even if a
> +	 * task was allowed to allocate THPs, BPF can decide whether its forked
> +	 * child can allocate THPs.
> +	 *
> +	 * The MMF_VM_HUGEPAGE flag will be cleared by khugepaged.
> +	 */
> +	if (mm_flags_test(MMF_VM_HUGEPAGE, oldmm) &&
> +		get_suggested_order(mm, NULL, 0, -1, BIT(PMD_ORDER)))

Hmmm so there seems to be some kind of additional functionality you're providing
here kinda quietly, which is to allow the exact same interface to determine
whether we kick off khugepaged or not.

Don't love that, I think we should be hugely specific about that.

This bpf interface should literally be 'ok we're deciding what order we
want'. It feels like a bit of a gross overloading?

>  		__khugepaged_enter(mm);
>  }
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 4108bcd96784..d10089e3f181 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -924,6 +924,18 @@ config NO_PAGE_MAPCOUNT
>
>  	  EXPERIMENTAL because the impact of some changes is still unclear.
>
> +config EXPERIMENTAL_BPF_ORDER_SELECTION
> +	bool "BPF-based THP order selection (EXPERIMENTAL)"
> +	depends on TRANSPARENT_HUGEPAGE && BPF_SYSCALL
> +
> +	help
> +	  Enable dynamic THP order selection using BPF programs. This
> +	  experimental feature allows custom BPF logic to determine optimal
> +	  transparent hugepage allocation sizes at runtime.
> +
> +	  Warning: This feature is unstable and may change in future kernel
> +	  versions.

Thanks! This is important to document. Absolute nitty nit: can you capitalise
'WARNING'? Thanks!

> +
>  endif # TRANSPARENT_HUGEPAGE
>
>  # simple helper to make the code a bit easier to read
> diff --git a/mm/Makefile b/mm/Makefile
> index ef54aa615d9d..cb55d1509be1 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -99,6 +99,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
>  obj-$(CONFIG_NUMA) += memory-tiers.o
>  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> +obj-$(CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION) += bpf_thp.o
>  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
>  obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
>  obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
> diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
> new file mode 100644
> index 000000000000..fbff3b1bb988
> --- /dev/null
> +++ b/mm/bpf_thp.c

As mentioned before, please update MAINTAINERS for new files. I went to great +
painful lengths to get everything listed there so let's keep it that way please
:P

> @@ -0,0 +1,186 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/bpf.h>
> +#include <linux/btf.h>
> +#include <linux/huge_mm.h>
> +#include <linux/khugepaged.h>
> +
> +struct bpf_thp_ops {
> +	/**
> +	 * @get_suggested_order: Get the suggested THP orders for allocation
> +	 * @mm: mm_struct associated with the THP allocation
> +	 * @vma__nullable: vm_area_struct associated with the THP allocation (may be NULL)
> +	 *                 When NULL, the decision should be based on @mm (i.e., when
> +	 *                 triggered from an mm-scope hook rather than a VMA-specific
> +	 *                 context).
> +	 *                 Must belong to @mm (guaranteed by the caller).
> +	 * @vma_flags: use these vm_flags instead of @vma->vm_flags (0 if @vma is NULL)
> +	 * @tva_flags: TVA flags for current @vma (-1 if @vma is NULL)
> +	 * @orders: Bitmask of requested THP orders for this allocation
> +	 *          - PMD-mapped allocation if PMD_ORDER is set
> +	 *          - mTHP allocation otherwise
> +	 *
> +	 * Rerurn: Bitmask of suggested THP orders for allocation. The highest
> +	 *         suggested order will not exceed the highest requested order
> +	 *         in @orders.
> +	 */
> +	int (*get_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> +				   u64 vma_flags, enum tva_type tva_flags, int orders) __rcu;

I feel like we should be declaring this function pointer type somewhere else as
we're now duplicating this in two places.

> +};
> +
> +static struct bpf_thp_ops bpf_thp;
> +static DEFINE_SPINLOCK(thp_ops_lock);
> +
> +int get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> +			u64 vma_flags, enum tva_type tva_flags, int orders)

surely tva_flag? As this is an enum value?

> +{
> +	int (*bpf_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> +				   u64 vma_flags, enum tva_type tva_flags, int orders);

This type for vma flags is totally incorrect. vm_flags_t. And that's going to
change soon to an opaque type.

Also right now it's actually an unsigned long.

I really really do not like that we're providing extra, unexplained VMA flags
for some reason. I may be missing something :) so happy to hear why this is
necessary.

However in future we really shouldn't be passing something like this.

Also - now a third duplication of the same function pointer :) can we do better
than this? At least typedef it.

> +	int suggested_orders = orders;
> +
> +	/* No BPF program is attached */
> +	if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> +		      &transparent_hugepage_flags))
> +		return suggested_orders;

This is atomic ofc, but are we concerned about races, or I guess you expect only
the first attached bpf program to work with it I suppose.

> +
> +	rcu_read_lock();

Is this sufficient? Anything stopping the mm or VMA going away here?

> +	bpf_suggested_order = rcu_dereference(bpf_thp.get_suggested_order);
> +	if (!bpf_suggested_order)
> +		goto out;
> +
> +	suggested_orders = bpf_suggested_order(mm, vma__nullable, vma_flags, tva_flags, orders);

OK so now it's suggested order_S but we're invoking suggested order :) whaaatt?
:)

> +	if (highest_order(suggested_orders) > highest_order(orders))
> +		suggested_orders = orders;

Hmmm so the semantics are - whichever is the highest order wins?

I thought the idea was we'd hand control over to bpf if provided in effect?

Definitely worth going over these semantics in the cover letter (and do forgive
me if you have and I've missed! :)

> +
> +out:
> +	rcu_read_unlock();
> +	return suggested_orders;
> +}
> +
> +static bool bpf_thp_ops_is_valid_access(int off, int size,
> +					enum bpf_access_type type,
> +					const struct bpf_prog *prog,
> +					struct bpf_insn_access_aux *info)
> +{
> +	return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
> +}
> +
> +static const struct bpf_func_proto *
> +bpf_thp_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> +{
> +	return bpf_base_func_proto(func_id, prog);
> +}
> +
> +static const struct bpf_verifier_ops thp_bpf_verifier_ops = {
> +	.get_func_proto = bpf_thp_get_func_proto,
> +	.is_valid_access = bpf_thp_ops_is_valid_access,
> +};
> +
> +static int bpf_thp_init(struct btf *btf)
> +{
> +	return 0;
> +}
> +
> +static int bpf_thp_init_member(const struct btf_type *t,
> +			       const struct btf_member *member,
> +			       void *kdata, const void *udata)
> +{
> +	return 0;
> +}
> +
> +static int bpf_thp_reg(void *kdata, struct bpf_link *link)
> +{
> +	struct bpf_thp_ops *ops = kdata;
> +
> +	spin_lock(&thp_ops_lock);
> +	if (test_and_set_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> +			     &transparent_hugepage_flags)) {
> +		spin_unlock(&thp_ops_lock);
> +		return -EBUSY;
> +	}
> +	WARN_ON_ONCE(rcu_access_pointer(bpf_thp.get_suggested_order));
> +	rcu_assign_pointer(bpf_thp.get_suggested_order, ops->get_suggested_order);
> +	spin_unlock(&thp_ops_lock);
> +	return 0;
> +}
> +
> +static void bpf_thp_unreg(void *kdata, struct bpf_link *link)
> +{
> +	spin_lock(&thp_ops_lock);
> +	clear_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, &transparent_hugepage_flags);
> +	WARN_ON_ONCE(!rcu_access_pointer(bpf_thp.get_suggested_order));
> +	rcu_replace_pointer(bpf_thp.get_suggested_order, NULL, lockdep_is_held(&thp_ops_lock));
> +	spin_unlock(&thp_ops_lock);
> +
> +	synchronize_rcu();
> +}

I am a total beginner with BPF implementations so don't feel like I can say much
intelligent about the above. But presumably fairly standard fare BPF-wise?

Will perhaps try to dig deeper on another iteration :) as intersting to me.

> +
> +static int bpf_thp_update(void *kdata, void *old_kdata, struct bpf_link *link)
> +{
> +	struct bpf_thp_ops *ops = kdata;
> +	struct bpf_thp_ops *old = old_kdata;
> +	int ret = 0;
> +
> +	if (!ops || !old)
> +		return -EINVAL;
> +
> +	spin_lock(&thp_ops_lock);
> +	/* The prog has aleady been removed. */
> +	if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, &transparent_hugepage_flags)) {
> +		ret = -ENOENT;
> +		goto out;
> +	}

OK so we gate things on this flag and it's global, got it.

I see this is a hook, and I guess RCU-all-the-things is what BPF does which
makes tonnes of sense.

> +	WARN_ON_ONCE(!rcu_access_pointer(bpf_thp.get_suggested_order));
> +	rcu_replace_pointer(bpf_thp.get_suggested_order, ops->get_suggested_order,
> +			    lockdep_is_held(&thp_ops_lock));
> +
> +out:
> +	spin_unlock(&thp_ops_lock);
> +	if (!ret)
> +		synchronize_rcu();
> +	return ret;
> +}
> +
> +static int bpf_thp_validate(void *kdata)
> +{
> +	struct bpf_thp_ops *ops = kdata;
> +
> +	if (!ops->get_suggested_order) {
> +		pr_err("bpf_thp: required ops isn't implemented\n");
> +		return -EINVAL;
> +	}
> +	return 0;
> +}
> +
> +static int suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> +			   u64 vma_flags, enum tva_type vm_flags, int orders)
> +{
> +	return orders;
> +}
> +
> +static struct bpf_thp_ops __bpf_thp_ops = {
> +	.get_suggested_order = suggested_order,
> +};

Can you explain to me what this stub stuff is for? This is more 'BPF impl 101'
stuff sorry :)

> +
> +static struct bpf_struct_ops bpf_bpf_thp_ops = {
> +	.verifier_ops = &thp_bpf_verifier_ops,
> +	.init = bpf_thp_init,
> +	.init_member = bpf_thp_init_member,
> +	.reg = bpf_thp_reg,
> +	.unreg = bpf_thp_unreg,
> +	.update = bpf_thp_update,
> +	.validate = bpf_thp_validate,
> +	.cfi_stubs = &__bpf_thp_ops,
> +	.owner = THIS_MODULE,
> +	.name = "bpf_thp_ops",
> +};
> +
> +static int __init bpf_thp_ops_init(void)
> +{
> +	int err = register_bpf_struct_ops(&bpf_bpf_thp_ops, bpf_thp_ops);
> +
> +	if (err)
> +		pr_err("bpf_thp: Failed to register struct_ops (%d)\n", err);
> +	return err;
> +}
> +late_initcall(bpf_thp_ops_init);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d89992b65acc..bd8f8f34ab3c 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1349,6 +1349,16 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>  		return ret;
>  	khugepaged_enter_vma(vma, vma->vm_flags);
>
> +	/*
> +	 * This check must occur after khugepaged_enter_vma() because:
> +	 * 1. We may permit THP allocation via khugepaged
> +	 * 2. While simultaneously disallowing THP allocation
> +	 *    during page fault handling
> +	 */
> +	if (get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_PAGEFAULT, BIT(PMD_ORDER)) !=
> +				BIT(PMD_ORDER))

Hmmm so you return a bitmask of orders, but then you only allow this fault if
the only order provided is PMD order? That seems strange. Can you explain?

> +		return VM_FAULT_FALLBACK;

It'd be good to have a helper function for this like:

	if (!bpf_hook_allow_pmd_order(vma, tva_flag))
		return VM_FAULT_FALLBACK;

And implemented like maybe:

static bool bpf_hook_allow_pmd_order(struct vm_area_struct *vma, enum tva_type tva_flag)
{
	int orders = get_suggested_order(vma->vm_mm, vma, vma->vm_flags, tva_flag,
			BIT(PMD_ORDER));

	return orders & BIT(PMD_ORDER);
}

It's good the tva flag gives context though.

> +
>  	if (!(vmf->flags & FAULT_FLAG_WRITE) &&
>  			!mm_forbids_zeropage(vma->vm_mm) &&
>  			transparent_hugepage_use_zero_page()) {
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index d3d4f116e14b..935583626db6 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -474,7 +474,9 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
>  {
>  	if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
>  	    hugepage_pmd_enabled()) {
> -		if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
> +		if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER) &&
> +		    get_suggested_order(vma->vm_mm, vma, vm_flags, TVA_KHUGEPAGED,
> +					BIT(PMD_ORDER)))

I don't know why we aren't working the bpf hook into thp_vma_allowable_order()?

Also a helper would work here.

>  			__khugepaged_enter(vma->vm_mm);
>  	}
>  }
> @@ -934,6 +936,8 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
>  		return SCAN_ADDRESS_RANGE;
>  	if (!thp_vma_allowable_order(vma, vma->vm_flags, type, PMD_ORDER))
>  		return SCAN_VMA_CHECK;
> +	if (!get_suggested_order(vma->vm_mm, vma, vma->vm_flags, type, BIT(PMD_ORDER)))
> +		return SCAN_VMA_CHECK;



>  	/*
>  	 * Anon VMA expected, the address may be unmapped then
>  	 * remapped to file after khugepaged reaquired the mmap_lock.
> @@ -1465,6 +1469,11 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
>  		/* khugepaged_mm_lock actually not necessary for the below */
>  		mm_slot_free(mm_slot_cache, mm_slot);
>  		mmdrop(mm);
> +	} else if (!get_suggested_order(mm, NULL, 0, -1, BIT(PMD_ORDER))) {
> +		hash_del(&slot->hash);
> +		list_del(&slot->mm_node);
> +		mm_flags_clear(MMF_VM_HUGEPAGE, mm);
> +		mm_slot_free(mm_slot_cache, mm_slot);
>  	}
>  }
>
> @@ -1538,6 +1547,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>  	if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
>  		return SCAN_VMA_CHECK;
>
> +	if (!get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_FORCED_COLLAPSE,
> +				 BIT(PMD_ORDER)))

Again, can we please not duplicate thp_vma_allowable_order() logic?

The THP code is horrible enough, but now we have to remember to also do the bpf
check?

> +		return SCAN_VMA_CHECK;
>  	/* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
>  	if (userfaultfd_wp(vma))
>  		return SCAN_PTE_UFFD_WP;
> @@ -2416,6 +2428,10 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>  	 * the next mm on the list.
>  	 */
>  	vma = NULL;
> +
> +	/* If this mm is not suitable for the scan list, we should remove it. */
> +	if (!get_suggested_order(mm, NULL, 0, -1, BIT(PMD_ORDER)))
> +		goto breakouterloop_mmap_lock;

OK again I'm really not loving this NULL, 0, -1 stuff. What is this supposed to
mean? The idea here is we have a hook for 'trying to determine THP order' and
now it's overloaded it seems in multiple ways?

I may be missing context here.

I'm also a bit perplexed by the comment as to what is intended here.

>  	if (unlikely(!mmap_read_trylock(mm)))
>  		goto breakouterloop_mmap_lock;
>
> @@ -2432,7 +2448,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>  			progress++;
>  			break;
>  		}
> -		if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
> +		if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER) ||
> +		    !get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_KHUGEPAGED,
> +					 BIT(PMD_ORDER))) {

Same various comments from above.

>  skip:
>  			progress++;
>  			continue;
> @@ -2769,6 +2787,10 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>  	if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
>  		return -EINVAL;
>
> +	if (!get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_FORCED_COLLAPSE,
> +				 BIT(PMD_ORDER)))
> +		return -EINVAL;
> +

Same various comments from above.

>  	cc = kmalloc(sizeof(*cc), GFP_KERNEL);
>  	if (!cc)
>  		return -ENOMEM;
> diff --git a/mm/memory.c b/mm/memory.c
> index d9de6c056179..0178857aa058 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4486,6 +4486,7 @@ static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset,
>  static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
> +	int order, suggested_orders;
>  	unsigned long orders;
>  	struct folio *folio;
>  	unsigned long addr;
> @@ -4493,7 +4494,6 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>  	spinlock_t *ptl;
>  	pte_t *pte;
>  	gfp_t gfp;
> -	int order;
>
>  	/*
>  	 * If uffd is active for the vma we need per-page fault fidelity to
> @@ -4510,13 +4510,18 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>  	if (!zswap_never_enabled())
>  		goto fallback;
>
> +	suggested_orders = get_suggested_order(vma->vm_mm, vma, vma->vm_flags,
> +					       TVA_PAGEFAULT,
> +					       BIT(PMD_ORDER) - 1);
> +	if (!suggested_orders)
> +		goto fallback;

Wait, but below we have a bunch of fallbacks, now BPF overrides everything?

I know I'm repaeting myself :P but can we just please put this into
thp_vma_allowable_orders(), it's massively gross to just duplicate this check
_everywhere_ with subtle differences.

>  	entry = pte_to_swp_entry(vmf->orig_pte);
>  	/*
>  	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
>  	 * and suitable for swapping THP.
>  	 */
>  	orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT,
> -					  BIT(PMD_ORDER) - 1);
> +					  suggested_orders);
>  	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
>  	orders = thp_swap_suitable_orders(swp_offset(entry),
>  					  vmf->address, orders);
> @@ -5044,12 +5049,12 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	int order, suggested_orders;
>  	unsigned long orders;
>  	struct folio *folio;
>  	unsigned long addr;
>  	pte_t *pte;
>  	gfp_t gfp;
> -	int order;
>
>  	/*
>  	 * If uffd is active for the vma we need per-page fault fidelity to
> @@ -5058,13 +5063,18 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>  	if (unlikely(userfaultfd_armed(vma)))
>  		goto fallback;
>
> +	suggested_orders = get_suggested_order(vma->vm_mm, vma, vma->vm_flags,
> +					       TVA_PAGEFAULT,
> +					       BIT(PMD_ORDER) - 1);
> +	if (!suggested_orders)
> +		goto fallback;

Same comment as above.

>  	/*
>  	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
>  	 * for this vma. Then filter out the orders that can't be allocated over
>  	 * the faulting address and still be fully contained in the vma.
>  	 */
>  	orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT,
> -					  BIT(PMD_ORDER) - 1);
> +					  suggested_orders);
>  	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
>
>  	if (!orders)
> --
> 2.47.3
>


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 01/10] mm: thp: add support for BPF based THP order selection
  2025-08-27 11:39     ` Yafang Shao
@ 2025-08-27 15:04       ` Lorenzo Stoakes
  0 siblings, 0 replies; 61+ messages in thread
From: Lorenzo Stoakes @ 2025-08-27 15:04 UTC (permalink / raw)
  To: Yafang Shao
  Cc: kernel test robot, akpm, david, ziy, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, oe-kbuild-all, bpf, linux-mm, linux-doc

On Wed, Aug 27, 2025 at 07:39:55PM +0800, Yafang Shao wrote:
> On Wed, Aug 27, 2025 at 10:58 AM kernel test robot <lkp@intel.com> wrote:
> >
> > Hi Yafang,
> >
> > kernel test robot noticed the following build warnings:
> >
> > [auto build test WARNING on akpm-mm/mm-everything]
> >
> > url:    https://github.com/intel-lab-lkp/linux/commits/Yafang-Shao/mm-thp-add-support-for-BPF-based-THP-order-selection/20250826-152415
> > base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> > patch link:    https://lore.kernel.org/r/20250826071948.2618-2-laoar.shao%40gmail.com
> > patch subject: [PATCH v6 mm-new 01/10] mm: thp: add support for BPF based THP order selection
> > config: loongarch-randconfig-r113-20250827 (https://download.01.org/0day-ci/archive/20250827/202508271009.5neOZ0OG-lkp@intel.com/config)
> > compiler: clang version 18.1.8 (https://github.com/llvm/llvm-project 3b5b5c1ec4a3095ab096dd780e84d7ab81f3d7ff)
> > reproduce: (https://download.01.org/0day-ci/archive/20250827/202508271009.5neOZ0OG-lkp@intel.com/reproduce)
> >
> > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > the same patch/commit), kindly add following tags
> > | Reported-by: kernel test robot <lkp@intel.com>
> > | Closes: https://lore.kernel.org/oe-kbuild-all/202508271009.5neOZ0OG-lkp@intel.com/
>
> Thanks for the report .
> It seems this sparse warning can be fixed with the below additional
> change, would you please test it again?
>
> diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
> index 46b3bc96359e..b2f97f9e930d 100644
> --- a/mm/bpf_thp.c
> +++ b/mm/bpf_thp.c
> @@ -5,27 +5,32 @@
>  #include <linux/huge_mm.h>
>  #include <linux/khugepaged.h>
>
> +/**
> + * @get_suggested_order: Get the suggested THP orders for allocation
> + * @mm: mm_struct associated with the THP allocation
> + * @vma__nullable: vm_area_struct associated with the THP allocation
> (may be NULL)
> + *                 When NULL, the decision should be based on @mm (i.e., when
> + *                 triggered from an mm-scope hook rather than a VMA-specific
> + *                 context).
> + *                 Must belong to @mm (guaranteed by the caller).
> + * @vma_flags: use these vm_flags instead of @vma->vm_flags (0 if @vma is NULL)
> + * @tva_flags: TVA flags for current @vma (-1 if @vma is NULL)
> + * @orders: Bitmask of requested THP orders for this allocation
> + *          - PMD-mapped allocation if PMD_ORDER is set
> + *          - mTHP allocation otherwise
> + *
> + * Rerurn: Bitmask of suggested THP orders for allocation. The highest
> + *         suggested order will not exceed the highest requested order
> + *         in @orders.
> + */
> +typedef int suggested_order_fn_t(struct mm_struct *mm,
> +                                struct vm_area_struct *vma__nullable,
> +                                u64 vma_flags,
> +                                enum tva_type tva_flags,
> +                                int orders);

Hm you are doing part of my review here as part of the fix :)

I think a respin is in order anyway so can tackle in future version.

Not sure the test bot can try out patches though? Not seen that before (nice if
it or somebody on other end does though! :)

> +
>  struct bpf_thp_ops {
> -       /**
> -        * @get_suggested_order: Get the suggested THP orders for allocation
> -        * @mm: mm_struct associated with the THP allocation
> -        * @vma__nullable: vm_area_struct associated with the THP
> allocation (may be NULL)
> -        *                 When NULL, the decision should be based on
> @mm (i.e., when
> -        *                 triggered from an mm-scope hook rather than
> a VMA-specific
> -        *                 context).
> -        *                 Must belong to @mm (guaranteed by the caller).
> -        * @vma_flags: use these vm_flags instead of @vma->vm_flags (0
> if @vma is NULL)
> -        * @tva_flags: TVA flags for current @vma (-1 if @vma is NULL)
> -        * @orders: Bitmask of requested THP orders for this allocation
> -        *          - PMD-mapped allocation if PMD_ORDER is set
> -        *          - mTHP allocation otherwise
> -        *
> -        * Rerurn: Bitmask of suggested THP orders for allocation. The highest
> -        *         suggested order will not exceed the highest requested order
> -        *         in @orders.
> -        */
> -       int (*get_suggested_order)(struct mm_struct *mm, struct
> vm_area_struct *vma__nullable,
> -                                  u64 vma_flags, enum tva_type
> tva_flags, int orders) __rcu;
> +       suggested_order_fn_t __rcu *get_suggested_order;
>  };
>
>  static struct bpf_thp_ops bpf_thp;
> @@ -34,8 +39,7 @@ static DEFINE_SPINLOCK(thp_ops_lock);
>  int get_suggested_order(struct mm_struct *mm, struct vm_area_struct
> *vma__nullable,
>                         u64 vma_flags, enum tva_type tva_flags, int orders)
>  {
> -       int (*bpf_suggested_order)(struct mm_struct *mm, struct
> vm_area_struct *vma__nullable,
> -                                  u64 vma_flags, enum tva_type
> tva_flags, int orders);
> +       suggested_order_fn_t *bpf_suggested_order;
>         int suggested_orders = orders;
>
>         /* No BPF program is attached */
> @@ -106,10 +110,12 @@ static int bpf_thp_reg(void *kdata, struct bpf_link *link)
>
>  static void bpf_thp_unreg(void *kdata, struct bpf_link *link)
>  {
> +       suggested_order_fn_t *old_fn;
> +
>         spin_lock(&thp_ops_lock);
>         clear_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> &transparent_hugepage_flags);
> -       WARN_ON_ONCE(!rcu_access_pointer(bpf_thp.get_suggested_order));
> -       rcu_replace_pointer(bpf_thp.get_suggested_order, NULL,
> lockdep_is_held(&thp_ops_lock));
> +       old_fn = rcu_replace_pointer(bpf_thp.get_suggested_order,
> NULL, lockdep_is_held(&thp_ops_lock));
> +       WARN_ON_ONCE(!old_fn);
>         spin_unlock(&thp_ops_lock);
>
>         synchronize_rcu();
> @@ -117,8 +123,9 @@ static void bpf_thp_unreg(void *kdata, struct
> bpf_link *link)
>
>  static int bpf_thp_update(void *kdata, void *old_kdata, struct bpf_link *link)
>  {
> -       struct bpf_thp_ops *ops = kdata;
> +       suggested_order_fn_t *old_fn, *new_fn;
>         struct bpf_thp_ops *old = old_kdata;
> +       struct bpf_thp_ops *ops = kdata;
>         int ret = 0;
>
>         if (!ops || !old)
> @@ -130,9 +137,10 @@ static int bpf_thp_update(void *kdata, void
> *old_kdata, struct bpf_link *link)
>                 ret = -ENOENT;
>                 goto out;
>         }
> -       WARN_ON_ONCE(!rcu_access_pointer(bpf_thp.get_suggested_order));
> -       rcu_replace_pointer(bpf_thp.get_suggested_order,
> ops->get_suggested_order,
> -                           lockdep_is_held(&thp_ops_lock));
> +
> +       new_fn = rcu_dereference(ops->get_suggested_order);
> +       old_fn = rcu_replace_pointer(bpf_thp.get_suggested_order,
> new_fn, lockdep_is_held(&thp_ops_lock));
> +       WARN_ON_ONCE(!old_fn || !new_fn);
>
>  out:
>         spin_unlock(&thp_ops_lock);
> @@ -159,7 +167,7 @@ static int suggested_order(struct mm_struct *mm,
> struct vm_area_struct *vma__nul
>  }
>
>  static struct bpf_thp_ops __bpf_thp_ops = {
> -       .get_suggested_order = suggested_order,
> +       .get_suggested_order = (suggested_order_fn_t __rcu *)suggested_order,
>  };
>
>  static struct bpf_struct_ops bpf_bpf_thp_ops = {
>
>
> --
> Regards
>
> Yafang


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 02/10] mm: thp: add a new kfunc bpf_mm_get_mem_cgroup()
  2025-08-26  7:19 ` [PATCH v6 mm-new 02/10] mm: thp: add a new kfunc bpf_mm_get_mem_cgroup() Yafang Shao
@ 2025-08-27 15:34   ` Lorenzo Stoakes
  2025-08-27 20:50     ` Shakeel Butt
  2025-08-28  6:57     ` Yafang Shao
  2025-08-27 20:45   ` Shakeel Butt
  1 sibling, 2 replies; 61+ messages in thread
From: Lorenzo Stoakes @ 2025-08-27 15:34 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, bpf, linux-mm,
	linux-doc, Michal Hocko, Roman Gushchin, Shakeel Butt

+cc cgroup people, please do include them on this stuff.

BTW I see there is a BPF [STORAGE & CGROUPS] section in MAINTAINERS and
kernel/bpf/cgroup.c etc. anything useful there for us?

On Tue, Aug 26, 2025 at 03:19:40PM +0800, Yafang Shao wrote:
> We will utilize this new kfunc bpf_mm_get_mem_cgroup() to retrieve the
> associated mem_cgroup from the given @mm. The obtained mem_cgroup must
> be released by calling bpf_put_mem_cgroup() as a paired operation.

What locking guarantees do we have that this is all fine?

>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  mm/bpf_thp.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++++-

Also not to be nitty (but I'm going to be anyway :P) but I'm not in love with
the filename here.

So now we have

- khugepaged.c
- huge_memory.c
- bpf_thp.c

Let's maybe call it huge_memory_bpf.c for consistency? And obv as mentioned
before, add it to the MAINTAINERS in the THP section plz.

>  1 file changed, 50 insertions(+), 1 deletion(-)
>
> diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
> index fbff3b1bb988..b757e8f425fd 100644
> --- a/mm/bpf_thp.c
> +++ b/mm/bpf_thp.c
> @@ -175,10 +175,59 @@ static struct bpf_struct_ops bpf_bpf_thp_ops = {
>  	.name = "bpf_thp_ops",
>  };
>
> +__bpf_kfunc_start_defs();
> +
> +/**
> + * bpf_mm_get_mem_cgroup - Get the memory cgroup associated with a mm_struct.
> + * @mm: The mm_struct to query
> + *
> + * The obtained mem_cgroup must be released by calling bpf_put_mem_cgroup().
> + *
> + * Return: The associated mem_cgroup on success, or NULL on failure. Note that
> + * this function depends on CONFIG_MEMCG being enabled - it will always return
> + * NULL if CONFIG_MEMCG is not configured.

What kind of locking is assumed here?

Are we protected against mmdrop() clearing out the mm?

> + */
> +__bpf_kfunc struct mem_cgroup *bpf_mm_get_mem_cgroup(struct mm_struct *mm)
> +{
> +	return get_mem_cgroup_from_mm(mm);
> +}
> +
> +/**
> + * bpf_put_mem_cgroup - Release a memory cgroup obtained from bpf_mm_get_mem_cgroup()
> + * @memcg: The memory cgroup to release
> + */
> +__bpf_kfunc void bpf_put_mem_cgroup(struct mem_cgroup *memcg)
> +{
> +#ifdef CONFIG_MEMCG
> +	if (!memcg)
> +		return;
> +	css_put(&memcg->css);

Feels weird to have an ifdef here but not elsewhere, maybe the whole thing
should be ifdef...?

Is there not a put equivalent for get_mem_cgroup_from_mm()? That is a bit weird.

Also do we now refrence the memcg global? That's pretty gross, could we not
actually implement such a helper?

Is it valid to do this also? Maybe cgroup people can chime in.

> +#endif
> +}
> +
> +__bpf_kfunc_end_defs();
> +
> +BTF_KFUNCS_START(bpf_thp_ids)
> +BTF_ID_FLAGS(func, bpf_mm_get_mem_cgroup, KF_TRUSTED_ARGS | KF_ACQUIRE | KF_RET_NULL)
> +BTF_ID_FLAGS(func, bpf_put_mem_cgroup, KF_RELEASE)
> +BTF_KFUNCS_END(bpf_thp_ids)
> +
> +static const struct btf_kfunc_id_set bpf_thp_set = {
> +	.owner = THIS_MODULE,
> +	.set = &bpf_thp_ids,
> +};
> +
>  static int __init bpf_thp_ops_init(void)
>  {
> -	int err = register_bpf_struct_ops(&bpf_bpf_thp_ops, bpf_thp_ops);
> +	int err;
> +
> +	err = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &bpf_thp_set);
> +	if (err) {
> +		pr_err("bpf_thp: Failed to register kfunc sets (%d)\n", err);
> +		return err;
> +	}
>
> +	err = register_bpf_struct_ops(&bpf_bpf_thp_ops, bpf_thp_ops);
>  	if (err)
>  		pr_err("bpf_thp: Failed to register struct_ops (%d)\n", err);
>  	return err;

Am again assuming this is legit BPF-wise :) Not my area... yet :>)

> --
> 2.47.3
>


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 03/10] mm: thp: add a new kfunc bpf_mm_get_task()
  2025-08-26  7:19 ` [PATCH v6 mm-new 03/10] mm: thp: add a new kfunc bpf_mm_get_task() Yafang Shao
@ 2025-08-27 15:42   ` Lorenzo Stoakes
  2025-08-27 21:50     ` Andrii Nakryiko
  2025-08-28  6:47     ` Yafang Shao
  0 siblings, 2 replies; 61+ messages in thread
From: Lorenzo Stoakes @ 2025-08-27 15:42 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, bpf, linux-mm,
	linux-doc

On Tue, Aug 26, 2025 at 03:19:41PM +0800, Yafang Shao wrote:
> We will utilize this new kfunc bpf_mm_get_task() to retrieve the
> associated task_struct from the given @mm. The obtained task_struct must
> be released by calling bpf_task_release() as a paired operation.

You're basically describing the patch you're not saying why - yeah you're
getting a task struct from an mm (only if CONFIG_MEMCG which you don't
mention here), but not for what purpose you intend to use this?

>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  mm/bpf_thp.c | 34 ++++++++++++++++++++++++++++++++++
>  1 file changed, 34 insertions(+)
>
> diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
> index b757e8f425fd..46b3bc96359e 100644
> --- a/mm/bpf_thp.c
> +++ b/mm/bpf_thp.c
> @@ -205,11 +205,45 @@ __bpf_kfunc void bpf_put_mem_cgroup(struct mem_cgroup *memcg)
>  #endif
>  }
>
> +/**
> + * bpf_mm_get_task - Get the task struct associated with a mm_struct.
> + * @mm: The mm_struct to query
> + *
> + * The obtained task_struct must be released by calling bpf_task_release().

Hmmm so now bpf programs can cause kernel bugs by keeping a reference around?

This feels extremely dodgy, I don't like this at all.

I thought the whole point of BPF was that this kind of thing couldn't possibly
happen?

Or would this be a kernel bug?

If a bpf program can lead to a refcount not being put, this is not
upstreamable surely?

> + *
> + * Return: The associated task_struct on success, or NULL on failure. Note that
> + * this function depends on CONFIG_MEMCG being enabled - it will always return
> + * NULL if CONFIG_MEMCG is not configured.
> + */
> +__bpf_kfunc struct task_struct *bpf_mm_get_task(struct mm_struct *mm)
> +{
> +#ifdef CONFIG_MEMCG
> +	struct task_struct *task;
> +
> +	if (!mm)
> +		return NULL;
> +	rcu_read_lock();
> +	task = rcu_dereference(mm->owner);

> +	if (!task)
> +		goto out;
> +	if (!refcount_inc_not_zero(&task->rcu_users))
> +		goto out;
> +
> +	rcu_read_unlock();
> +	return task;
> +
> +out:
> +	rcu_read_unlock();
> +#endif

This #ifdeffery is horrid, can we please just have separate functions instead of
inside the one? Thanks.

> +	return NULL;

So we can't tell the difference between this failling due to CONFIG_MEMCG
not being set (in which case it will _always_ fail) or we couldn't get a
task or we couldn't get a refcount on the task.

Maybe this doesn't matter since perhaps we are only using this if
CONFIG_MEMCG but in that case why even expose this if !CONFIG_MEMCG?

> +}
> +
>  __bpf_kfunc_end_defs();
>
>  BTF_KFUNCS_START(bpf_thp_ids)
>  BTF_ID_FLAGS(func, bpf_mm_get_mem_cgroup, KF_TRUSTED_ARGS | KF_ACQUIRE | KF_RET_NULL)
>  BTF_ID_FLAGS(func, bpf_put_mem_cgroup, KF_RELEASE)
> +BTF_ID_FLAGS(func, bpf_mm_get_task, KF_TRUSTED_ARGS | KF_ACQUIRE | KF_RET_NULL)
>  BTF_KFUNCS_END(bpf_thp_ids)
>
>  static const struct btf_kfunc_id_set bpf_thp_set = {
> --
> 2.47.3
>


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 04/10] bpf: mark vma->vm_mm as trusted
  2025-08-26  7:19 ` [PATCH v6 mm-new 04/10] bpf: mark vma->vm_mm as trusted Yafang Shao
@ 2025-08-27 15:45   ` Lorenzo Stoakes
  2025-08-28  6:12     ` Yafang Shao
  0 siblings, 1 reply; 61+ messages in thread
From: Lorenzo Stoakes @ 2025-08-27 15:45 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, bpf, linux-mm,
	linux-doc

On Tue, Aug 26, 2025 at 03:19:42PM +0800, Yafang Shao wrote:
> Every VMA must have an associated mm_struct, and it is safe to access

Err this isn't true? Pretty sure special VMAs don't have that set.

> outside of RCU. Thus, we can mark it as trusted. With this change, BPF
> helpers can safely access vma->vm_mm to retrieve the associated task
> from the VMA.

On the basis of above don't think this is valid.

>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  kernel/bpf/verifier.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index c4f69a9e9af6..984ffbca5cbe 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -7154,6 +7154,10 @@ BTF_TYPE_SAFE_TRUSTED(struct file) {
>  	struct inode *f_inode;
>  };
>
> +BTF_TYPE_SAFE_TRUSTED(struct vm_area_struct) {
> +	struct mm_struct *vm_mm;
> +};
> +
>  BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct dentry) {
>  	struct inode *d_inode;
>  };
> @@ -7193,6 +7197,7 @@ static bool type_is_trusted(struct bpf_verifier_env *env,
>  	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED(struct bpf_iter__task));
>  	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED(struct linux_binprm));
>  	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED(struct file));
> +	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED(struct vm_area_struct));
>
>  	return btf_nested_type_is_trusted(&env->log, reg, field_name, btf_id, "__safe_trusted");
>  }
> --
> 2.47.3
>


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 10/10] MAINTAINERS: add entry for BPF-based THP adjustment
  2025-08-26  7:19 ` [PATCH v6 mm-new 10/10] MAINTAINERS: add entry for BPF-based THP adjustment Yafang Shao
@ 2025-08-27 15:47   ` Lorenzo Stoakes
  2025-08-28  6:08     ` Yafang Shao
  0 siblings, 1 reply; 61+ messages in thread
From: Lorenzo Stoakes @ 2025-08-27 15:47 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, bpf, linux-mm,
	linux-doc

On Tue, Aug 26, 2025 at 03:19:48PM +0800, Yafang Shao wrote:
> Add maintainership entry for the experimental BPF-driven THP adjustment
> feature. This experimental component may be removed in future releases.
> I will help with maintenance tasks for this feature during its development
> lifecycle.
>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  MAINTAINERS | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 390829ae9803..71d0f7c58ce8 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -16239,6 +16239,7 @@ F:	Documentation/admin-guide/mm/transhuge.rst
>  F:	include/linux/huge_mm.h
>  F:	include/linux/khugepaged.h
>  F:	include/trace/events/huge_memory.h
> +F:	mm/bpf_thp.c
>  F:	mm/huge_memory.c
>  F:	mm/khugepaged.c
>  F:	mm/mm_slot.h
> @@ -16246,6 +16247,15 @@ F:	tools/testing/selftests/mm/khugepaged.c
>  F:	tools/testing/selftests/mm/split_huge_page_test.c
>  F:	tools/testing/selftests/mm/transhuge-stress.c
>
> +MEMORY MANAGEMENT - THP WITH BPF SUPPORT
> +M:	Yafang Shao <laoar.shao@gmail.com>
> +L:	bpf@vger.kernel.org
> +L:	linux-mm@kvack.org
> +S:	Maintained
> +F:	mm/bpf_thp.c
> +F:	tools/testing/selftests/bpf/prog_tests/thp_adjust.c
> +F:	tools/testing/selftests/bpf/progs/test_thp_adjust*
> +

Sorry but I don't agree with a separate section for this.

This should form part of the THP section only, I don't think it's warranted to
do elsewise.

>  MEMORY MANAGEMENT - USERFAULTFD
>  M:	Andrew Morton <akpm@linux-foundation.org>
>  R:	Peter Xu <peterx@redhat.com>
> --
> 2.47.3
>


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 02/10] mm: thp: add a new kfunc bpf_mm_get_mem_cgroup()
  2025-08-26  7:19 ` [PATCH v6 mm-new 02/10] mm: thp: add a new kfunc bpf_mm_get_mem_cgroup() Yafang Shao
  2025-08-27 15:34   ` Lorenzo Stoakes
@ 2025-08-27 20:45   ` Shakeel Butt
  2025-08-28  6:58     ` Yafang Shao
  1 sibling, 1 reply; 61+ messages in thread
From: Shakeel Butt @ 2025-08-27 20:45 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, bpf, linux-mm, linux-doc

On Tue, Aug 26, 2025 at 03:19:40PM +0800, Yafang Shao wrote:
> We will utilize this new kfunc bpf_mm_get_mem_cgroup() to retrieve the
> associated mem_cgroup from the given @mm. The obtained mem_cgroup must
> be released by calling bpf_put_mem_cgroup() as a paired operation.
> 
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  mm/bpf_thp.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 50 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
> index fbff3b1bb988..b757e8f425fd 100644
> --- a/mm/bpf_thp.c
> +++ b/mm/bpf_thp.c
> @@ -175,10 +175,59 @@ static struct bpf_struct_ops bpf_bpf_thp_ops = {
>  	.name = "bpf_thp_ops",
>  };
>  
> +__bpf_kfunc_start_defs();
> +
> +/**
> + * bpf_mm_get_mem_cgroup - Get the memory cgroup associated with a mm_struct.
> + * @mm: The mm_struct to query
> + *
> + * The obtained mem_cgroup must be released by calling bpf_put_mem_cgroup().
> + *
> + * Return: The associated mem_cgroup on success, or NULL on failure. Note that
> + * this function depends on CONFIG_MEMCG being enabled - it will always return
> + * NULL if CONFIG_MEMCG is not configured.
> + */
> +__bpf_kfunc struct mem_cgroup *bpf_mm_get_mem_cgroup(struct mm_struct *mm)
> +{
> +	return get_mem_cgroup_from_mm(mm);
> +}
> +
> +/**
> + * bpf_put_mem_cgroup - Release a memory cgroup obtained from bpf_mm_get_mem_cgroup()
> + * @memcg: The memory cgroup to release
> + */
> +__bpf_kfunc void bpf_put_mem_cgroup(struct mem_cgroup *memcg)
> +{
> +#ifdef CONFIG_MEMCG
> +	if (!memcg)
> +		return;
> +	css_put(&memcg->css);
> +#endif

Just use mem_cgroup_put() here.

> +}
> +
> +__bpf_kfunc_end_defs();
> +
> +BTF_KFUNCS_START(bpf_thp_ids)
> +BTF_ID_FLAGS(func, bpf_mm_get_mem_cgroup, KF_TRUSTED_ARGS | KF_ACQUIRE | KF_RET_NULL)
> +BTF_ID_FLAGS(func, bpf_put_mem_cgroup, KF_RELEASE)
> +BTF_KFUNCS_END(bpf_thp_ids)
> +
> +static const struct btf_kfunc_id_set bpf_thp_set = {
> +	.owner = THIS_MODULE,
> +	.set = &bpf_thp_ids,
> +};
> +
>  static int __init bpf_thp_ops_init(void)
>  {
> -	int err = register_bpf_struct_ops(&bpf_bpf_thp_ops, bpf_thp_ops);
> +	int err;
> +
> +	err = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &bpf_thp_set);
> +	if (err) {
> +		pr_err("bpf_thp: Failed to register kfunc sets (%d)\n", err);
> +		return err;
> +	}
>  
> +	err = register_bpf_struct_ops(&bpf_bpf_thp_ops, bpf_thp_ops);
>  	if (err)
>  		pr_err("bpf_thp: Failed to register struct_ops (%d)\n", err);
>  	return err;
> -- 
> 2.47.3
> 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 02/10] mm: thp: add a new kfunc bpf_mm_get_mem_cgroup()
  2025-08-27 15:34   ` Lorenzo Stoakes
@ 2025-08-27 20:50     ` Shakeel Butt
  2025-08-28 10:40       ` Lorenzo Stoakes
  2025-08-28  6:57     ` Yafang Shao
  1 sibling, 1 reply; 61+ messages in thread
From: Shakeel Butt @ 2025-08-27 20:50 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Yafang Shao, akpm, david, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
	willy, ast, daniel, andrii, ameryhung, rientjes, corbet, bpf,
	linux-mm, linux-doc, Michal Hocko, Roman Gushchin

On Wed, Aug 27, 2025 at 04:34:48PM +0100, Lorenzo Stoakes wrote:
> > +__bpf_kfunc_start_defs();
> > +
> > +/**
> > + * bpf_mm_get_mem_cgroup - Get the memory cgroup associated with a mm_struct.
> > + * @mm: The mm_struct to query
> > + *
> > + * The obtained mem_cgroup must be released by calling bpf_put_mem_cgroup().
> > + *
> > + * Return: The associated mem_cgroup on success, or NULL on failure. Note that
> > + * this function depends on CONFIG_MEMCG being enabled - it will always return
> > + * NULL if CONFIG_MEMCG is not configured.
> 
> What kind of locking is assumed here?
> 
> Are we protected against mmdrop() clearing out the mm?

No locking is needed. Just the valid mm object or NULL. Usually the
underlying function (get_mem_cgroup_from_mm) is called in page fault
context where the current is holding mm. Here the only requirement is
that mm is valid either through explicit reference or the context.

> 
> > + */
> > +__bpf_kfunc struct mem_cgroup *bpf_mm_get_mem_cgroup(struct mm_struct *mm)
> > +{
> > +	return get_mem_cgroup_from_mm(mm);
> > +}
> > +
> > +/**
> > + * bpf_put_mem_cgroup - Release a memory cgroup obtained from bpf_mm_get_mem_cgroup()
> > + * @memcg: The memory cgroup to release
> > + */
> > +__bpf_kfunc void bpf_put_mem_cgroup(struct mem_cgroup *memcg)
> > +{
> > +#ifdef CONFIG_MEMCG
> > +	if (!memcg)
> > +		return;
> > +	css_put(&memcg->css);
> 
> Feels weird to have an ifdef here but not elsewhere, maybe the whole thing
> should be ifdef...?
> 
> Is there not a put equivalent for get_mem_cgroup_from_mm()? That is a bit weird.
> 
> Also do we now refrence the memcg global? That's pretty gross, could we not
> actually implement such a helper?
> 
> Is it valid to do this also? Maybe cgroup people can chime in.

There is mem_cgroup_put() which should handle !CONFIG_MEMCG configs.



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 03/10] mm: thp: add a new kfunc bpf_mm_get_task()
  2025-08-27 15:42   ` Lorenzo Stoakes
@ 2025-08-27 21:50     ` Andrii Nakryiko
  2025-08-28  6:50       ` Yafang Shao
  2025-08-28 10:51       ` Lorenzo Stoakes
  2025-08-28  6:47     ` Yafang Shao
  1 sibling, 2 replies; 61+ messages in thread
From: Andrii Nakryiko @ 2025-08-27 21:50 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Yafang Shao, akpm, david, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
	willy, ast, daniel, andrii, ameryhung, rientjes, corbet, bpf,
	linux-mm, linux-doc

On Wed, Aug 27, 2025 at 8:48 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Tue, Aug 26, 2025 at 03:19:41PM +0800, Yafang Shao wrote:
> > We will utilize this new kfunc bpf_mm_get_task() to retrieve the
> > associated task_struct from the given @mm. The obtained task_struct must
> > be released by calling bpf_task_release() as a paired operation.
>
> You're basically describing the patch you're not saying why - yeah you're
> getting a task struct from an mm (only if CONFIG_MEMCG which you don't
> mention here), but not for what purpose you intend to use this?
>
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  mm/bpf_thp.c | 34 ++++++++++++++++++++++++++++++++++
> >  1 file changed, 34 insertions(+)
> >
> > diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
> > index b757e8f425fd..46b3bc96359e 100644
> > --- a/mm/bpf_thp.c
> > +++ b/mm/bpf_thp.c
> > @@ -205,11 +205,45 @@ __bpf_kfunc void bpf_put_mem_cgroup(struct mem_cgroup *memcg)
> >  #endif
> >  }
> >
> > +/**
> > + * bpf_mm_get_task - Get the task struct associated with a mm_struct.
> > + * @mm: The mm_struct to query
> > + *
> > + * The obtained task_struct must be released by calling bpf_task_release().
>
> Hmmm so now bpf programs can cause kernel bugs by keeping a reference around?

BPF verifier will reject any program that cannot guarantee that
bpf_task_release() will always be called. So there shouldn't be any
problem here.

>
> This feels extremely dodgy, I don't like this at all.
>
> I thought the whole point of BPF was that this kind of thing couldn't possibly
> happen?
>
> Or would this be a kernel bug?
>
> If a bpf program can lead to a refcount not being put, this is not
> upstreamable surely?
>
> > + *
> > + * Return: The associated task_struct on success, or NULL on failure. Note that
> > + * this function depends on CONFIG_MEMCG being enabled - it will always return
> > + * NULL if CONFIG_MEMCG is not configured.
> > + */
> > +__bpf_kfunc struct task_struct *bpf_mm_get_task(struct mm_struct *mm)
> > +{
> > +#ifdef CONFIG_MEMCG
> > +     struct task_struct *task;
> > +
> > +     if (!mm)
> > +             return NULL;
> > +     rcu_read_lock();
> > +     task = rcu_dereference(mm->owner);

Question to Yafang, though. Instead of adding new kfunc just for this,
have you tried marking mm->owner as BTF_TYPE_SAFE_TRUSTED_OR_NULL,
which, if I understand correctly, would allow BPF program to just work
with `mm->owner` (after checking for NULL) directly. And then you can
just use existing bpf_task_acquire()

>
> > +     if (!task)
> > +             goto out;
> > +     if (!refcount_inc_not_zero(&task->rcu_users))
> > +             goto out;

nit: just call bpf_task_acquire(), which will more obviously pair with
suggested bpf_task_release()?

> > +
> > +     rcu_read_unlock();
> > +     return task;
> > +
> > +out:
> > +     rcu_read_unlock();
> > +#endif
>
> This #ifdeffery is horrid, can we please just have separate functions instead of
> inside the one? Thanks.
>
> > +     return NULL;
>
> So we can't tell the difference between this failling due to CONFIG_MEMCG
> not being set (in which case it will _always_ fail) or we couldn't get a
> task or we couldn't get a refcount on the task.
>
> Maybe this doesn't matter since perhaps we are only using this if
> CONFIG_MEMCG but in that case why even expose this if !CONFIG_MEMCG?
>
> > +}
> > +
> >  __bpf_kfunc_end_defs();
> >
> >  BTF_KFUNCS_START(bpf_thp_ids)
> >  BTF_ID_FLAGS(func, bpf_mm_get_mem_cgroup, KF_TRUSTED_ARGS | KF_ACQUIRE | KF_RET_NULL)
> >  BTF_ID_FLAGS(func, bpf_put_mem_cgroup, KF_RELEASE)
> > +BTF_ID_FLAGS(func, bpf_mm_get_task, KF_TRUSTED_ARGS | KF_ACQUIRE | KF_RET_NULL)
> >  BTF_KFUNCS_END(bpf_thp_ids)
> >
> >  static const struct btf_kfunc_id_set bpf_thp_set = {
> > --
> > 2.47.3
> >


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 00/10] mm, bpf: BPF based THP order selection
  2025-08-27 13:14 ` Lorenzo Stoakes
@ 2025-08-28  2:58   ` Yafang Shao
  0 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2025-08-28  2:58 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, bpf, linux-mm,
	linux-doc

On Wed, Aug 27, 2025 at 9:14 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Tue, Aug 26, 2025 at 03:19:38PM +0800, Yafang Shao wrote:
> > Background
> > ==========
> >
> > Our production servers consistently configure THP to "never" due to
> > historical incidents caused by its behavior. Key issues include:
> > - Increased Memory Consumption
> >   THP significantly raises overall memory usage, reducing available memory
> >   for workloads.
> >
> > - Latency Spikes
> >   Random latency spikes occur due to frequent memory compaction triggered
> >   by THP.
> >
> > - Lack of Fine-Grained Control
> >   THP tuning is globally configured, making it unsuitable for containerized
> >   environments. When multiple workloads share a host, enabling THP without
> >   per-workload control leads to unpredictable behavior.
> >
> > Due to these issues, administrators avoid switching to madvise or always
> > modes—unless per-workload THP control is implemented.
> >
> > To address this, we propose BPF-based THP policy for flexible adjustment.
> > Additionally, as David mentioned [0], this mechanism can also serve as a
> > policy prototyping tool (test policies via BPF before upstreaming them).
>

Thank you for providing so many comments.  I'll take some time to go
through it carefully and will reply afterward.

> I think it's important to highlight here that we are exploring an _experimental_
> implementation.

I will add it.

>
> >
> > Proposed Solution
> > =================
> >
> > As suggested by David [0], we introduce a new BPF interface:
>
> I do agree, to be clear, with this broad approach - that is, to provide the
> minimum information that a reasonable decision can be made upon and to keep
> things as simple as we can.
>
> As per the THP cabal (I think? :) the general consensus was in line with
> this.

My testing in both test and production indicates the following
parameters are essential:
- mm_struct (associated with the THP allocation)
- vma_flags (VM_HUGEPAGE, VM_NOHUGEPAGE, or N/A)
- tva_type
- The requested THP orders bitmask

I will retain these four and remove @vma__nullable.

>
>
> >
> > /**
> >  * @get_suggested_order: Get the suggested THP orders for allocation
> >  * @mm: mm_struct associated with the THP allocation
> >  * @vma__nullable: vm_area_struct associated with the THP allocation (may be NULL)
> >  *                 When NULL, the decision should be based on @mm (i.e., when
> >  *                 triggered from an mm-scope hook rather than a VMA-specific
> >  *                 context).
>
> I'm a little wary of handing a VMA to BPF, under what locking would it be
> provided?

We cannot arbitrarily use members of the struct vm_area_struct because
they are untrusted pointers. The only trusted pointer is vma->vm_mm,
which can be accessed without holding any additional locks. For the
VMA itself, the caller at the callsite has already taken the necessary
locks, so we do not need to acquire them again.

My testing shows the @vma parameter is not needed. I will remove it in
the next update.

>
> >  *                 Must belong to @mm (guaranteed by the caller).
> >  * @vma_flags: use these vm_flags instead of @vma->vm_flags (0 if @vma is NULL)
>
> Hmm this one is also a bit odd - why would these flags differ? Note that I will
> be changing the VMA flags to a bitmap relatively soon which may be larger than
> the system word size.
>
> So 'handing around all the flags' is something we probably want to avoid.

Good suggestion. Since we specifically need to identify VM_HUGEPAGE or
VM_NOHUGEPAGE, I will add a new enum for clarity, bpf_thp_vma_type:

+enum bpf_thp_vma_type {
+       BPF_VM_NONE = 0,
+       BPF_VM_HUGEPAGE,        /* VM_HUGEPAGE */
+       BPF_VM_NOHUGEPAGE,      /* VM_NOHUGEPAGE */
+};

The enum can be extended in the future to support file-backed THP by
adding new types.

>
> For the f_op->mmap_prepare stuff I provided an abstraction
>
> >  * @tva_flags: TVA flags for current @vma (-1 if @vma is NULL)
> >  * @orders: Bitmask of requested THP orders for this allocation
> >  *          - PMD-mapped allocation if PMD_ORDER is set
> >  *          - mTHP allocation otherwise
> >  *
> >  * Rerurn: Bitmask of suggested THP orders for allocation. The highest
>
> Obv. a cover letter thing but typo her :P rerurn -> return.

will change it.

>
> >  *         suggested order will not exceed the highest requested order
> >  *         in @orders.
>
> In what sense are they 'suggested'? Is this a product of sysfs settings or? I
> think this needs to be clearer.

The order is suggested by a BPF program. I will clarify it in the next version.

>
> >  */
> >  int (*get_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> >                             u64 vma_flags, enum tva_type tva_flags, int orders) __rcu;
>
> Also here in what sense is this suggested? :)

Agreed. I'll rename it to bpf_hook_thp_get_order() as suggested for clarity.

>
> >
> > This interface:
> > - Supports both use cases (per-workload tuning + policy prototyping).
> > - Can be extended with BPF helpers (e.g., for memory pressure awareness).
>
> Hm how would extensions like this work?

To optimize THP allocation, we should consult the PSI  data
beforehand. If memory pressure is already high—indicating difficulty
in allocating high-order pages—the system should default to allocating
4K pages instead. This could be implemented by checking the PSI data
of the relevant cgroup:

  struct cgroup *cgrp = task_dfl_cgroup(mm->owner);
  struct psi_group *psi = cgroup_psi(cgrp);  // or psi_system
  u64 psi_data = psi->total[PSI_AVGS][PSI_MEM];

The allocation strategy would then branch based on the value of
psi_data. This may require new BPF helpers to access PSI data
efficiently.

>
> >
> > This is an experimental feature. To use it, you must enable
> > CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION.
>
> Yes! Thanks. I am glad we are putting this behind a config flag.
>
> >
> > Warning:
> > - The interface may change
> > - Behavior may differ in future kernel versions
> > - We might remove it in the future
> >
> >
> > Selftests
> > =========
> >
> > BPF selftests
> > -------------
> >
> > Patch #5: Implements a basic BPF THP policy that restricts THP allocation
> >           via khugepaged to tasks within a specified memory cgroup.
> > Patch #6: Contains test cases validating the khugepaged fork behavior.
> > Patch #7: Provides tests for dynamic BPF program updates and replacement.
> > Patch #8: Includes negative tests for invalid BPF helper usage, verifying
> >           proper verification by the BPF verifier.
> >
> > Currently, several dependency patches reside in mm-new but haven't been
> > merged into bpf-next:
> >   mm: add bitmap mm->flags field
> >   mm/huge_memory: convert "tva_flags" to "enum tva_type"
> >   mm: convert core mm to mm_flags_*() accessors
> >
> > To enable BPF CI testing, these dependencies were manually applied to
> > bpf-next [1]. All selftests in this series pass successfully. The observed
> > CI failures are unrelated to these changes.
>
> Cool, glad at least my mm changes were ok :)
>
> >
> > Performance Evaluation
> > ----------------------
> >
> > As suggested by Usama [2], performance impact was measured given the page
> > fault handler modifications. The standard `perf bench mem memset` benchmark
> > was employed to assess page fault performance.
> >
> > Testing was conducted on an AMD EPYC 7W83 64-Core Processor (single NUMA
> > node). Due to variance between individual test runs, a script executed
> > 10000 iterations to calculate meaningful averages and standard deviations.
> >
> > The results across three configurations show negligible performance impact:
> > - Baseline (without this patch series)
> > - With patch series but no BPF program attached
> > - With patch series and BPF program attached
> >
> > The result are as follows,
> >
> >   Number of runs: 10,000
> >   Average throughput: 40-41 GB/sec
> >   Standard deviation: 7-8 GB/sec
>
> You're not giving data comparing the 3? Could you do so? Thanks.

I tested all three cases. The results from the three test cases were
similar, so I aggregated the data.

>
> >
> > Production verification
> > -----------------------
> >
> > We have successfully deployed a variant of this approach across numerous
> > Kubernetes production servers. The implementation enables THP for specific
> > workloads (such as applications utilizing ZGC [3]) while disabling it for
> > others. This selective deployment has operated flawlessly, with no
> > regression reports to date.
> >
> > For ZGC-based applications, our verification demonstrates that shmem THP
> > delivers significant improvements:
> > - Reduced CPU utilization
> > - Lower average latencies
>
> Obviously it's _really key_ to point out that this feature is intendend to
> be _absolutely_ ephemeral - we may or may not implement something like this
> - it's really about both exploring how such an interface might look and
> also helping to determine how an 'automagic' future might look.

Our users can benefit from this feature, which is why we have already
deployed it on our production servers. We are now extending it to more
workloads, such as RDMA applications, where THP provides significant
performance gains. Given the complexity of our production environment,
we have found that manual control is a necessary practice. I am
presenting this case solely to demonstrate the feature's stability and
that it does not introduce regressions. However, I understand this use
case is not recommended by the maintainers and will clarify this in
the next version.

>
> >
> > Future work
> > ===========
> >
> > Based on our validation with production workloads, we observed mixed
> > results with XFS large folios (also known as File THP):
> >
> > - Performance Benefits
> >   Some workloads demonstrated significant improvements with XFS large
> >   folios enabled
> > - Performance Regression
> >   Some workloads experienced degradation when using XFS large folios
> >
> > These results demonstrate that File THP, similar to anonymous THP, requires
> > a more granular approach instead of a uniform implementation.
> >
> > We will extend the BPF-based order selection mechanism to support File THP
> > allocation policies.
> >
> > Link: https://lwn.net/ml/all/9bc57721-5287-416c-aa30-46932d605f63@redhat.com/ [0]
> > Link: https://github.com/kernel-patches/bpf/pull/9561 [1]
> > Link: https://lwn.net/ml/all/a24d632d-4b11-4c88-9ed0-26fa12a0fce4@gmail.com/ [2]
> > Link: https://wiki.openjdk.org/display/zgc/Main#Main-EnablingTransparentHugePagesOnLinux [3]
> >
> > Changes:
> > =======
> >
> > RFC v5-> v6:
> > - Code improvement around the RCU usage (Usama)
> > - Add selftests for khugepaged fork (Usama)
> > - Add performance data for page fault (Usama)
> > - Remove the RFC tag
> >
>
> Sorry I haven't been involved in the RFC reviews, always intended to but
> workload etc.
>
> Will be looking through this series as very interested in exploring this
> approach.

Thanks a lot for your reviews.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 01/10] mm: thp: add support for BPF based THP order selection
  2025-08-27 15:03   ` Lorenzo Stoakes
@ 2025-08-28  5:54     ` Yafang Shao
  2025-08-28 10:50       ` Lorenzo Stoakes
  0 siblings, 1 reply; 61+ messages in thread
From: Yafang Shao @ 2025-08-28  5:54 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, bpf, linux-mm,
	linux-doc

On Wed, Aug 27, 2025 at 11:03 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Tue, Aug 26, 2025 at 03:19:39PM +0800, Yafang Shao wrote:
> > This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
> > THP tuning. It includes a hook get_suggested_order() [0], allowing BPF
> > programs to influence THP order selection based on factors such as:
> > - Workload identity
> >   For example, workloads running in specific containers or cgroups.
> > - Allocation context
> >   Whether the allocation occurs during a page fault, khugepaged, or other
> >   paths.
> > - System memory pressure
> >   (May require new BPF helpers to accurately assess memory pressure.)
> >
> > Key Details:
> > - Only one BPF program can be attached at a time, but it can be updated
> >   dynamically to adjust the policy.
> > - Supports automatic mTHP order selection and per-workload THP policies.
> > - Only functional when THP is set to madise or always.
> >
> > It requires CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION to enable. [1]
> > This feature is unstable and may evolve in future kernel versions.
> >
> > Link: https://lwn.net/ml/all/9bc57721-5287-416c-aa30-46932d605f63@redhat.com/ [0]
> > Link: https://lwn.net/ml/all/dda67ea5-2943-497c-a8e5-d81f0733047d@lucifer.local/ [1]
> >
> > Suggested-by: David Hildenbrand <david@redhat.com>
> > Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  include/linux/huge_mm.h    |  15 +++
> >  include/linux/khugepaged.h |  12 ++-
> >  mm/Kconfig                 |  12 +++
> >  mm/Makefile                |   1 +
> >  mm/bpf_thp.c               | 186 +++++++++++++++++++++++++++++++++++++
>
> Please add new files to MAINTAINERS as you add them.

will do it.

>
> >  mm/huge_memory.c           |  10 ++
> >  mm/khugepaged.c            |  26 +++++-
> >  mm/memory.c                |  18 +++-
> >  8 files changed, 273 insertions(+), 7 deletions(-)
> >  create mode 100644 mm/bpf_thp.c
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 1ac0d06fb3c1..f0c91d7bd267 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -6,6 +6,8 @@
> >
> >  #include <linux/fs.h> /* only for vma_is_dax() */
> >  #include <linux/kobject.h>
> > +#include <linux/pgtable.h>
> > +#include <linux/mm.h>
>
> Hm this is a bit weird as mm.h includes huge_mm... I guess it will be handled by
> header defines but still.

Some refactoring is needed for these two header files, but we can
handle it separately later.

>
> >
> >  vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
> >  int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > @@ -56,6 +58,7 @@ enum transparent_hugepage_flag {
> >       TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
> >       TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
> >       TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
> > +     TRANSPARENT_HUGEPAGE_BPF_ATTACHED,      /* BPF prog is attached */
> >  };
> >
> >  struct kobject;
> > @@ -195,6 +198,18 @@ static inline bool hugepage_global_always(void)
> >                       (1<<TRANSPARENT_HUGEPAGE_FLAG);
> >  }
> >
> > +#ifdef CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION
> > +int get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> > +                     u64 vma_flags, enum tva_type tva_flags, int orders);
>
> Not a massive fan of this naming to be honest. I think it should explicitly
> reference bpf, e.g. bpf_hook_thp_get_order() or something.

will change it to bpf_hook_thp_get_orders().

>
> Right now this is super unclear as to what it's for.
>
> Also wrt vma_flags - this type is wrong :) it's vm_flags_t and going to change
> to a bitmap of unlimiiteeed size soon. So probs best not to pass around as value
> type either.

As replied in another thread. I will change it.

>
> But unclear us to purpose as mentioned elsewhere.
>
> And also get_suggested_order() should be get_suggested_orderS() no? As you
> seem later in the code to be referencing a bitfield?

Right, it should be bpf_hook_thp_get_orderS().

>
> Also will mm ever != vma->vm_mm?

No it can't. It can be guaranteed by the caller.

>
> Are we hacking this for the sake of overloading what this does?

The @vma is actually unneeded. I will remove it.

>
> Also if we're returning a bitmask of orders which you seem to be (not sure I
> like that tbh - I feel like we shoudl simply provide one order but open for
> disucssion) - shouldn't it return an unsigned long?

We are indifferent to whether a single order or a bitmask is returned,
as we only use order-0 and order-9. We have no use cases for
middle-order pages, though this feature might be useful for other
architectures or for some special use cases.

>
> > +#else
> > +static inline int
> > +get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> > +                 u64 vma_flags, enum tva_type tva_flags, int orders)
> > +{
> > +     return orders;
> > +}
> > +#endif
> > +
> >  static inline int highest_order(unsigned long orders)
> >  {
> >       return fls_long(orders) - 1;
> > diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> > index eb1946a70cff..d81c1228a21f 100644
> > --- a/include/linux/khugepaged.h
> > +++ b/include/linux/khugepaged.h
> > @@ -4,6 +4,8 @@
> >
> >  #include <linux/mm.h>
> >
> > +#include <linux/huge_mm.h>
> > +
>
> Hm this is iffy too, There's probably a reason we didn't include this before,
> the headers can be so so fragile. Let's be cautious...

I will check.

>
> >  extern unsigned int khugepaged_max_ptes_none __read_mostly;
> >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >  extern struct attribute_group khugepaged_attr_group;
> > @@ -22,7 +24,15 @@ extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> >
> >  static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
> >  {
> > -     if (mm_flags_test(MMF_VM_HUGEPAGE, oldmm))
> > +     /*
> > +      * THP allocation policy can be dynamically modified via BPF. Even if a
> > +      * task was allowed to allocate THPs, BPF can decide whether its forked
> > +      * child can allocate THPs.
> > +      *
> > +      * The MMF_VM_HUGEPAGE flag will be cleared by khugepaged.
> > +      */
> > +     if (mm_flags_test(MMF_VM_HUGEPAGE, oldmm) &&
> > +             get_suggested_order(mm, NULL, 0, -1, BIT(PMD_ORDER)))
>
> Hmmm so there seems to be some kind of additional functionality you're providing
> here kinda quietly, which is to allow the exact same interface to determine
> whether we kick off khugepaged or not.
>
> Don't love that, I think we should be hugely specific about that.
>
> This bpf interface should literally be 'ok we're deciding what order we
> want'. It feels like a bit of a gross overloading?

This makes sense. I have no objection to reverting to returning a single order.

>
> >               __khugepaged_enter(mm);
> >  }
> >
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 4108bcd96784..d10089e3f181 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -924,6 +924,18 @@ config NO_PAGE_MAPCOUNT
> >
> >         EXPERIMENTAL because the impact of some changes is still unclear.
> >
> > +config EXPERIMENTAL_BPF_ORDER_SELECTION
> > +     bool "BPF-based THP order selection (EXPERIMENTAL)"
> > +     depends on TRANSPARENT_HUGEPAGE && BPF_SYSCALL
> > +
> > +     help
> > +       Enable dynamic THP order selection using BPF programs. This
> > +       experimental feature allows custom BPF logic to determine optimal
> > +       transparent hugepage allocation sizes at runtime.
> > +
> > +       Warning: This feature is unstable and may change in future kernel
> > +       versions.
>
> Thanks! This is important to document. Absolute nitty nit: can you capitalise
> 'WARNING'? Thanks!

will do it.

>
> > +
> >  endif # TRANSPARENT_HUGEPAGE
> >
> >  # simple helper to make the code a bit easier to read
> > diff --git a/mm/Makefile b/mm/Makefile
> > index ef54aa615d9d..cb55d1509be1 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -99,6 +99,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
> >  obj-$(CONFIG_NUMA) += memory-tiers.o
> >  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
> >  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> > +obj-$(CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION) += bpf_thp.o
> >  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
> >  obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
> >  obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
> > diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
> > new file mode 100644
> > index 000000000000..fbff3b1bb988
> > --- /dev/null
> > +++ b/mm/bpf_thp.c
>
> As mentioned before, please update MAINTAINERS for new files. I went to great +
> painful lengths to get everything listed there so let's keep it that way please
> :P

will do it.

>
> > @@ -0,0 +1,186 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +#include <linux/bpf.h>
> > +#include <linux/btf.h>
> > +#include <linux/huge_mm.h>
> > +#include <linux/khugepaged.h>
> > +
> > +struct bpf_thp_ops {
> > +     /**
> > +      * @get_suggested_order: Get the suggested THP orders for allocation
> > +      * @mm: mm_struct associated with the THP allocation
> > +      * @vma__nullable: vm_area_struct associated with the THP allocation (may be NULL)
> > +      *                 When NULL, the decision should be based on @mm (i.e., when
> > +      *                 triggered from an mm-scope hook rather than a VMA-specific
> > +      *                 context).
> > +      *                 Must belong to @mm (guaranteed by the caller).
> > +      * @vma_flags: use these vm_flags instead of @vma->vm_flags (0 if @vma is NULL)
> > +      * @tva_flags: TVA flags for current @vma (-1 if @vma is NULL)
> > +      * @orders: Bitmask of requested THP orders for this allocation
> > +      *          - PMD-mapped allocation if PMD_ORDER is set
> > +      *          - mTHP allocation otherwise
> > +      *
> > +      * Rerurn: Bitmask of suggested THP orders for allocation. The highest
> > +      *         suggested order will not exceed the highest requested order
> > +      *         in @orders.
> > +      */
> > +     int (*get_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> > +                                u64 vma_flags, enum tva_type tva_flags, int orders) __rcu;
>
> I feel like we should be declaring this function pointer type somewhere else as
> we're now duplicating this in two places.

agreed, I have already done it to fix the spare warning.

>
> > +};
> > +
> > +static struct bpf_thp_ops bpf_thp;
> > +static DEFINE_SPINLOCK(thp_ops_lock);
> > +
> > +int get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> > +                     u64 vma_flags, enum tva_type tva_flags, int orders)
>
> surely tva_flag? As this is an enum value?

will change it to tva_type instead.

>
> > +{
> > +     int (*bpf_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> > +                                u64 vma_flags, enum tva_type tva_flags, int orders);
>
> This type for vma flags is totally incorrect. vm_flags_t. And that's going to
> change soon to an opaque type.
>
> Also right now it's actually an unsigned long.
>
> I really really do not like that we're providing extra, unexplained VMA flags
> for some reason. I may be missing something :) so happy to hear why this is
> necessary.
>
> However in future we really shouldn't be passing something like this.

will change it as replied in another thread.

>
> Also - now a third duplication of the same function pointer :) can we do better
> than this? At least typedef it.
>
> > +     int suggested_orders = orders;
> > +
> > +     /* No BPF program is attached */
> > +     if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> > +                   &transparent_hugepage_flags))
> > +             return suggested_orders;
>
> This is atomic ofc, but are we concerned about races, or I guess you expect only
> the first attached bpf program to work with it I suppose.

It is against the race to unreg or update.

>
> > +
> > +     rcu_read_lock();
>
> Is this sufficient? Anything stopping the mm or VMA going away here?

This RCU lock is not for protecting the mm or VMA structures
themselves, but for protecting the update of the function pointer.
Arbitrary access to pointers within the mm_struct or vm_area_struct is
prohibited, as they are guarded by the BPF verifier.

>
> > +     bpf_suggested_order = rcu_dereference(bpf_thp.get_suggested_order);
> > +     if (!bpf_suggested_order)
> > +             goto out;
> > +
> > +     suggested_orders = bpf_suggested_order(mm, vma__nullable, vma_flags, tva_flags, orders);
>
> OK so now it's suggested order_S but we're invoking suggested order :) whaaatt?
> :)

will change it.

>
> > +     if (highest_order(suggested_orders) > highest_order(orders))
> > +             suggested_orders = orders;
>
> Hmmm so the semantics are - whichever is the highest order wins?

The maximum requested order is determined by the callsite. For example:
- PMD-mapped THP uses PMD_ORDER
- mTHP uses (PMD_ORDER - 1)

We must respect this upper bound to avoid undefined behavior. So the
highest suggested order can't exceed the highest requested order.

>
> I thought the idea was we'd hand control over to bpf if provided in effect?
>
> Definitely worth going over these semantics in the cover letter (and do forgive
> me if you have and I've missed! :)

It has already in the cover letter:

 * Return: Bitmask of suggested THP orders for allocation. The highest
 *         suggested order will not exceed the highest requested order
 *         in @orders.


>
> > +
> > +out:
> > +     rcu_read_unlock();
> > +     return suggested_orders;
> > +}
> > +
> > +static bool bpf_thp_ops_is_valid_access(int off, int size,
> > +                                     enum bpf_access_type type,
> > +                                     const struct bpf_prog *prog,
> > +                                     struct bpf_insn_access_aux *info)
> > +{
> > +     return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
> > +}
> > +
> > +static const struct bpf_func_proto *
> > +bpf_thp_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> > +{
> > +     return bpf_base_func_proto(func_id, prog);
> > +}
> > +
> > +static const struct bpf_verifier_ops thp_bpf_verifier_ops = {
> > +     .get_func_proto = bpf_thp_get_func_proto,
> > +     .is_valid_access = bpf_thp_ops_is_valid_access,
> > +};
> > +
> > +static int bpf_thp_init(struct btf *btf)
> > +{
> > +     return 0;
> > +}
> > +
> > +static int bpf_thp_init_member(const struct btf_type *t,
> > +                            const struct btf_member *member,
> > +                            void *kdata, const void *udata)
> > +{
> > +     return 0;
> > +}
> > +
> > +static int bpf_thp_reg(void *kdata, struct bpf_link *link)
> > +{
> > +     struct bpf_thp_ops *ops = kdata;
> > +
> > +     spin_lock(&thp_ops_lock);
> > +     if (test_and_set_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> > +                          &transparent_hugepage_flags)) {
> > +             spin_unlock(&thp_ops_lock);
> > +             return -EBUSY;
> > +     }
> > +     WARN_ON_ONCE(rcu_access_pointer(bpf_thp.get_suggested_order));
> > +     rcu_assign_pointer(bpf_thp.get_suggested_order, ops->get_suggested_order);
> > +     spin_unlock(&thp_ops_lock);
> > +     return 0;
> > +}
> > +
> > +static void bpf_thp_unreg(void *kdata, struct bpf_link *link)
> > +{
> > +     spin_lock(&thp_ops_lock);
> > +     clear_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, &transparent_hugepage_flags);
> > +     WARN_ON_ONCE(!rcu_access_pointer(bpf_thp.get_suggested_order));
> > +     rcu_replace_pointer(bpf_thp.get_suggested_order, NULL, lockdep_is_held(&thp_ops_lock));
> > +     spin_unlock(&thp_ops_lock);
> > +
> > +     synchronize_rcu();
> > +}
>
> I am a total beginner with BPF implementations so don't feel like I can say much
> intelligent about the above. But presumably fairly standard fare BPF-wise?

This implementation is necessary to support BPF program updates.

>
> Will perhaps try to dig deeper on another iteration :) as intersting to me.
>
> > +
> > +static int bpf_thp_update(void *kdata, void *old_kdata, struct bpf_link *link)
> > +{
> > +     struct bpf_thp_ops *ops = kdata;
> > +     struct bpf_thp_ops *old = old_kdata;
> > +     int ret = 0;
> > +
> > +     if (!ops || !old)
> > +             return -EINVAL;
> > +
> > +     spin_lock(&thp_ops_lock);
> > +     /* The prog has aleady been removed. */
> > +     if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, &transparent_hugepage_flags)) {
> > +             ret = -ENOENT;
> > +             goto out;
> > +     }
>
> OK so we gate things on this flag and it's global, got it.
>
> I see this is a hook, and I guess RCU-all-the-things is what BPF does which
> makes tonnes of sense.
>
> > +     WARN_ON_ONCE(!rcu_access_pointer(bpf_thp.get_suggested_order));
> > +     rcu_replace_pointer(bpf_thp.get_suggested_order, ops->get_suggested_order,
> > +                         lockdep_is_held(&thp_ops_lock));
> > +
> > +out:
> > +     spin_unlock(&thp_ops_lock);
> > +     if (!ret)
> > +             synchronize_rcu();
> > +     return ret;
> > +}
> > +
> > +static int bpf_thp_validate(void *kdata)
> > +{
> > +     struct bpf_thp_ops *ops = kdata;
> > +
> > +     if (!ops->get_suggested_order) {
> > +             pr_err("bpf_thp: required ops isn't implemented\n");
> > +             return -EINVAL;
> > +     }
> > +     return 0;
> > +}
> > +
> > +static int suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> > +                        u64 vma_flags, enum tva_type vm_flags, int orders)
> > +{
> > +     return orders;
> > +}
> > +
> > +static struct bpf_thp_ops __bpf_thp_ops = {
> > +     .get_suggested_order = suggested_order,
> > +};
>
> Can you explain to me what this stub stuff is for? This is more 'BPF impl 101'
> stuff sorry :)

It is a CFI stub. cfi_stubs in BPF struct_ops are secure intermediary
functions that prevent the kernel from making direct, unsafe jumps to
BPF code. A new attached BPF program will run via this stub.

>
> > +
> > +static struct bpf_struct_ops bpf_bpf_thp_ops = {
> > +     .verifier_ops = &thp_bpf_verifier_ops,
> > +     .init = bpf_thp_init,
> > +     .init_member = bpf_thp_init_member,
> > +     .reg = bpf_thp_reg,
> > +     .unreg = bpf_thp_unreg,
> > +     .update = bpf_thp_update,
> > +     .validate = bpf_thp_validate,
> > +     .cfi_stubs = &__bpf_thp_ops,
> > +     .owner = THIS_MODULE,
> > +     .name = "bpf_thp_ops",
> > +};
> > +
> > +static int __init bpf_thp_ops_init(void)
> > +{
> > +     int err = register_bpf_struct_ops(&bpf_bpf_thp_ops, bpf_thp_ops);
> > +
> > +     if (err)
> > +             pr_err("bpf_thp: Failed to register struct_ops (%d)\n", err);
> > +     return err;
> > +}
> > +late_initcall(bpf_thp_ops_init);
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index d89992b65acc..bd8f8f34ab3c 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1349,6 +1349,16 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> >               return ret;
> >       khugepaged_enter_vma(vma, vma->vm_flags);
> >
> > +     /*
> > +      * This check must occur after khugepaged_enter_vma() because:
> > +      * 1. We may permit THP allocation via khugepaged
> > +      * 2. While simultaneously disallowing THP allocation
> > +      *    during page fault handling
> > +      */
> > +     if (get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_PAGEFAULT, BIT(PMD_ORDER)) !=
> > +                             BIT(PMD_ORDER))
>
> Hmmm so you return a bitmask of orders, but then you only allow this fault if
> the only order provided is PMD order? That seems strange. Can you explain?

This is in the do_huge_pmd_anonymous_page() that can only accept a PMD
order, otherwise it might result in unexpected behavior.

>
> > +             return VM_FAULT_FALLBACK;
>
> It'd be good to have a helper function for this like:
>
>         if (!bpf_hook_allow_pmd_order(vma, tva_flag))
>                 return VM_FAULT_FALLBACK;
>
> And implemented like maybe:
>
> static bool bpf_hook_allow_pmd_order(struct vm_area_struct *vma, enum tva_type tva_flag)
> {
>         int orders = get_suggested_order(vma->vm_mm, vma, vma->vm_flags, tva_flag,
>                         BIT(PMD_ORDER));
>
>         return orders & BIT(PMD_ORDER);
> }
>
> It's good the tva flag gives context though.

Thanks for the suggestion.
will change it.

>
> > +
> >       if (!(vmf->flags & FAULT_FLAG_WRITE) &&
> >                       !mm_forbids_zeropage(vma->vm_mm) &&
> >                       transparent_hugepage_use_zero_page()) {
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c

> > index d3d4f116e14b..935583626db6 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -474,7 +474,9 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
> >  {
> >       if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
> >           hugepage_pmd_enabled()) {
> > -             if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
> > +             if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER) &&
> > +                 get_suggested_order(vma->vm_mm, vma, vm_flags, TVA_KHUGEPAGED,
> > +                                     BIT(PMD_ORDER)))
>
> I don't know why we aren't working the bpf hook into thp_vma_allowable_order()?

Actually it can be added into thp_vma_allowable_order().  I will change it.

>
> Also a helper would work here.
>
> >                       __khugepaged_enter(vma->vm_mm);
> >       }
> >  }
> > @@ -934,6 +936,8 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> >               return SCAN_ADDRESS_RANGE;
> >       if (!thp_vma_allowable_order(vma, vma->vm_flags, type, PMD_ORDER))
> >               return SCAN_VMA_CHECK;
> > +     if (!get_suggested_order(vma->vm_mm, vma, vma->vm_flags, type, BIT(PMD_ORDER)))
> > +             return SCAN_VMA_CHECK;
>
>
>
> >       /*
> >        * Anon VMA expected, the address may be unmapped then
> >        * remapped to file after khugepaged reaquired the mmap_lock.
> > @@ -1465,6 +1469,11 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
> >               /* khugepaged_mm_lock actually not necessary for the below */
> >               mm_slot_free(mm_slot_cache, mm_slot);
> >               mmdrop(mm);
> > +     } else if (!get_suggested_order(mm, NULL, 0, -1, BIT(PMD_ORDER))) {
> > +             hash_del(&slot->hash);
> > +             list_del(&slot->mm_node);
> > +             mm_flags_clear(MMF_VM_HUGEPAGE, mm);
> > +             mm_slot_free(mm_slot_cache, mm_slot);
> >       }
> >  }
> >
> > @@ -1538,6 +1547,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> >       if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
> >               return SCAN_VMA_CHECK;
> >
> > +     if (!get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_FORCED_COLLAPSE,
> > +                              BIT(PMD_ORDER)))
>
> Again, can we please not duplicate thp_vma_allowable_order() logic?
>
> The THP code is horrible enough, but now we have to remember to also do the bpf
> check?

makes sense.

>
> > +             return SCAN_VMA_CHECK;
> >       /* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
> >       if (userfaultfd_wp(vma))
> >               return SCAN_PTE_UFFD_WP;
> > @@ -2416,6 +2428,10 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> >        * the next mm on the list.
> >        */
> >       vma = NULL;
> > +
> > +     /* If this mm is not suitable for the scan list, we should remove it. */
> > +     if (!get_suggested_order(mm, NULL, 0, -1, BIT(PMD_ORDER)))
> > +             goto breakouterloop_mmap_lock;
>
> OK again I'm really not loving this NULL, 0, -1 stuff. What is this supposed to
> mean? The idea here is we have a hook for 'trying to determine THP order' and
> now it's overloaded it seems in multiple ways?
>
> I may be missing context here.
>
> I'm also a bit perplexed by the comment as to what is intended here.

Using a BPF-based approach for THP adjustment allows us to dynamically
enable or disable THP for running applications without causing any
disruption. This capability is particularly valuable in production
environments. The logic here is designed to achieve exactly that.


>
> >       if (unlikely(!mmap_read_trylock(mm)))
> >               goto breakouterloop_mmap_lock;
> >
> > @@ -2432,7 +2448,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> >                       progress++;
> >                       break;
> >               }
> > -             if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
> > +             if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER) ||
> > +                 !get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_KHUGEPAGED,
> > +                                      BIT(PMD_ORDER))) {
>
> Same various comments from above.

will change it.

>
> >  skip:
> >                       progress++;
> >                       continue;
> > @@ -2769,6 +2787,10 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> >       if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
> >               return -EINVAL;
> >
> > +     if (!get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_FORCED_COLLAPSE,
> > +                              BIT(PMD_ORDER)))
> > +             return -EINVAL;
> > +
>
> Same various comments from above.

will change it.

>
> >       cc = kmalloc(sizeof(*cc), GFP_KERNEL);
> >       if (!cc)
> >               return -ENOMEM;
> > diff --git a/mm/memory.c b/mm/memory.c
> > index d9de6c056179..0178857aa058 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4486,6 +4486,7 @@ static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset,
> >  static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> >  {
> >       struct vm_area_struct *vma = vmf->vma;
> > +     int order, suggested_orders;
> >       unsigned long orders;
> >       struct folio *folio;
> >       unsigned long addr;
> > @@ -4493,7 +4494,6 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> >       spinlock_t *ptl;
> >       pte_t *pte;
> >       gfp_t gfp;
> > -     int order;
> >
> >       /*
> >        * If uffd is active for the vma we need per-page fault fidelity to
> > @@ -4510,13 +4510,18 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> >       if (!zswap_never_enabled())
> >               goto fallback;
> >
> > +     suggested_orders = get_suggested_order(vma->vm_mm, vma, vma->vm_flags,
> > +                                            TVA_PAGEFAULT,
> > +                                            BIT(PMD_ORDER) - 1);
> > +     if (!suggested_orders)
> > +             goto fallback;
>
> Wait, but below we have a bunch of fallbacks, now BPF overrides everything?

When allocating high-order pages is not feasible, such as during
periods of high memory pressure, the system should immediately fall
back to using 4 KB pages.

>
> I know I'm repaeting myself :P but can we just please put this into
> thp_vma_allowable_orders(), it's massively gross to just duplicate this check
> _everywhere_ with subtle differences.

will change it.

>
> >       entry = pte_to_swp_entry(vmf->orig_pte);
> >       /*
> >        * Get a list of all the (large) orders below PMD_ORDER that are enabled
> >        * and suitable for swapping THP.
> >        */
> >       orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT,
> > -                                       BIT(PMD_ORDER) - 1);
> > +                                       suggested_orders);
> >       orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> >       orders = thp_swap_suitable_orders(swp_offset(entry),
> >                                         vmf->address, orders);
> > @@ -5044,12 +5049,12 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> >  {
> >       struct vm_area_struct *vma = vmf->vma;
> >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +     int order, suggested_orders;
> >       unsigned long orders;
> >       struct folio *folio;
> >       unsigned long addr;
> >       pte_t *pte;
> >       gfp_t gfp;
> > -     int order;
> >
> >       /*
> >        * If uffd is active for the vma we need per-page fault fidelity to
> > @@ -5058,13 +5063,18 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> >       if (unlikely(userfaultfd_armed(vma)))
> >               goto fallback;
> >
> > +     suggested_orders = get_suggested_order(vma->vm_mm, vma, vma->vm_flags,
> > +                                            TVA_PAGEFAULT,
> > +                                            BIT(PMD_ORDER) - 1);
> > +     if (!suggested_orders)
> > +             goto fallback;
>
> Same comment as above.

will change it.


Thanks a lot for your comments.


--
Regards

Yafang


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 10/10] MAINTAINERS: add entry for BPF-based THP adjustment
  2025-08-27 15:47   ` Lorenzo Stoakes
@ 2025-08-28  6:08     ` Yafang Shao
  0 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2025-08-28  6:08 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, bpf, linux-mm,
	linux-doc

On Wed, Aug 27, 2025 at 11:47 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Tue, Aug 26, 2025 at 03:19:48PM +0800, Yafang Shao wrote:
> > Add maintainership entry for the experimental BPF-driven THP adjustment
> > feature. This experimental component may be removed in future releases.
> > I will help with maintenance tasks for this feature during its development
> > lifecycle.
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  MAINTAINERS | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 390829ae9803..71d0f7c58ce8 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -16239,6 +16239,7 @@ F:    Documentation/admin-guide/mm/transhuge.rst
> >  F:   include/linux/huge_mm.h
> >  F:   include/linux/khugepaged.h
> >  F:   include/trace/events/huge_memory.h
> > +F:   mm/bpf_thp.c
> >  F:   mm/huge_memory.c
> >  F:   mm/khugepaged.c
> >  F:   mm/mm_slot.h
> > @@ -16246,6 +16247,15 @@ F:   tools/testing/selftests/mm/khugepaged.c
> >  F:   tools/testing/selftests/mm/split_huge_page_test.c
> >  F:   tools/testing/selftests/mm/transhuge-stress.c
> >
> > +MEMORY MANAGEMENT - THP WITH BPF SUPPORT
> > +M:   Yafang Shao <laoar.shao@gmail.com>
> > +L:   bpf@vger.kernel.org
> > +L:   linux-mm@kvack.org
> > +S:   Maintained
> > +F:   mm/bpf_thp.c
> > +F:   tools/testing/selftests/bpf/prog_tests/thp_adjust.c
> > +F:   tools/testing/selftests/bpf/progs/test_thp_adjust*
> > +
>
> Sorry but I don't agree with a separate section for this.
>
> This should form part of the THP section only, I don't think it's warranted to
> do elsewise.

I initially added it as a separate entry to ensure that
bpf@vger.kernel.org would be CCed. However, I discovered that any file
containing “bpf” will automatically CC that list. Therefore, it’s fine
to include this under the THP entry instead.


-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 04/10] bpf: mark vma->vm_mm as trusted
  2025-08-27 15:45   ` Lorenzo Stoakes
@ 2025-08-28  6:12     ` Yafang Shao
  2025-08-28 11:11       ` Lorenzo Stoakes
  0 siblings, 1 reply; 61+ messages in thread
From: Yafang Shao @ 2025-08-28  6:12 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, bpf, linux-mm,
	linux-doc

On Wed, Aug 27, 2025 at 11:46 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Tue, Aug 26, 2025 at 03:19:42PM +0800, Yafang Shao wrote:
> > Every VMA must have an associated mm_struct, and it is safe to access
>
> Err this isn't true? Pretty sure special VMAs don't have that set.

I’m not aware of any VMA that doesn’t belong to an mm_struct. If there
is such a case, it would be helpful if you could point it out. In any
case, I’ll remove the VMA-related code in the next version since it’s
unnecessary.

>
> > outside of RCU. Thus, we can mark it as trusted. With this change, BPF
> > helpers can safely access vma->vm_mm to retrieve the associated task
> > from the VMA.
>
> On the basis of above don't think this is valid.
>

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 03/10] mm: thp: add a new kfunc bpf_mm_get_task()
  2025-08-27 15:42   ` Lorenzo Stoakes
  2025-08-27 21:50     ` Andrii Nakryiko
@ 2025-08-28  6:47     ` Yafang Shao
  2025-08-29 10:43       ` Lorenzo Stoakes
  1 sibling, 1 reply; 61+ messages in thread
From: Yafang Shao @ 2025-08-28  6:47 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, bpf, linux-mm,
	linux-doc

On Wed, Aug 27, 2025 at 11:42 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Tue, Aug 26, 2025 at 03:19:41PM +0800, Yafang Shao wrote:
> > We will utilize this new kfunc bpf_mm_get_task() to retrieve the
> > associated task_struct from the given @mm. The obtained task_struct must
> > be released by calling bpf_task_release() as a paired operation.
>
> You're basically describing the patch you're not saying why - yeah you're
> getting a task struct from an mm (only if CONFIG_MEMCG which you don't
> mention here), but not for what purpose you intend to use this?

For example, we could retrieve task->comm or other attributes and make
decisions based on that information. I’ll provide a clearer
description in the next revision.

>
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  mm/bpf_thp.c | 34 ++++++++++++++++++++++++++++++++++
> >  1 file changed, 34 insertions(+)
> >
> > diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
> > index b757e8f425fd..46b3bc96359e 100644
> > --- a/mm/bpf_thp.c
> > +++ b/mm/bpf_thp.c
> > @@ -205,11 +205,45 @@ __bpf_kfunc void bpf_put_mem_cgroup(struct mem_cgroup *memcg)
> >  #endif
> >  }
> >
> > +/**
> > + * bpf_mm_get_task - Get the task struct associated with a mm_struct.
> > + * @mm: The mm_struct to query
> > + *
> > + * The obtained task_struct must be released by calling bpf_task_release().
>
> Hmmm so now bpf programs can cause kernel bugs by keeping a reference around?
>
> This feels extremely dodgy, I don't like this at all.
>
> I thought the whole point of BPF was that this kind of thing couldn't possibly
> happen?
>
> Or would this be a kernel bug?
>
> If a bpf program can lead to a refcount not being put, this is not
> upstreamable surely?

As explained by Andrii, the BPF verifier can protect it.

>
> > + *
> > + * Return: The associated task_struct on success, or NULL on failure. Note that
> > + * this function depends on CONFIG_MEMCG being enabled - it will always return
> > + * NULL if CONFIG_MEMCG is not configured.
> > + */
> > +__bpf_kfunc struct task_struct *bpf_mm_get_task(struct mm_struct *mm)
> > +{
> > +#ifdef CONFIG_MEMCG
> > +     struct task_struct *task;
> > +
> > +     if (!mm)
> > +             return NULL;
> > +     rcu_read_lock();
> > +     task = rcu_dereference(mm->owner);
>
> > +     if (!task)
> > +             goto out;
> > +     if (!refcount_inc_not_zero(&task->rcu_users))
> > +             goto out;
> > +
> > +     rcu_read_unlock();
> > +     return task;
> > +
> > +out:
> > +     rcu_read_unlock();
> > +#endif
>
> This #ifdeffery is horrid, can we please just have separate functions instead of
> inside the one? Thanks.
>
> > +     return NULL;
>
> So we can't tell the difference between this failling due to CONFIG_MEMCG
> not being set (in which case it will _always_ fail) or we couldn't get a
> task or we couldn't get a refcount on the task.
>
> Maybe this doesn't matter since perhaps we are only using this if
> CONFIG_MEMCG but in that case why even expose this if !CONFIG_MEMCG?
>

As suggested by Andrii, I will remove this kfunc and mark mm->owner as
BTF_TYPE_SAFE_TRUSTED_OR_NULL.

Thanks for your comments.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 03/10] mm: thp: add a new kfunc bpf_mm_get_task()
  2025-08-27 21:50     ` Andrii Nakryiko
@ 2025-08-28  6:50       ` Yafang Shao
  2025-08-28 10:51       ` Lorenzo Stoakes
  1 sibling, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2025-08-28  6:50 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Lorenzo Stoakes, akpm, david, ziy, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, bpf, linux-mm, linux-doc

On Thu, Aug 28, 2025 at 5:50 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Wed, Aug 27, 2025 at 8:48 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Tue, Aug 26, 2025 at 03:19:41PM +0800, Yafang Shao wrote:
> > > We will utilize this new kfunc bpf_mm_get_task() to retrieve the
> > > associated task_struct from the given @mm. The obtained task_struct must
> > > be released by calling bpf_task_release() as a paired operation.
> >
> > You're basically describing the patch you're not saying why - yeah you're
> > getting a task struct from an mm (only if CONFIG_MEMCG which you don't
> > mention here), but not for what purpose you intend to use this?
> >
> > >
> > > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > > ---
> > >  mm/bpf_thp.c | 34 ++++++++++++++++++++++++++++++++++
> > >  1 file changed, 34 insertions(+)
> > >
> > > diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
> > > index b757e8f425fd..46b3bc96359e 100644
> > > --- a/mm/bpf_thp.c
> > > +++ b/mm/bpf_thp.c
> > > @@ -205,11 +205,45 @@ __bpf_kfunc void bpf_put_mem_cgroup(struct mem_cgroup *memcg)
> > >  #endif
> > >  }
> > >
> > > +/**
> > > + * bpf_mm_get_task - Get the task struct associated with a mm_struct.
> > > + * @mm: The mm_struct to query
> > > + *
> > > + * The obtained task_struct must be released by calling bpf_task_release().
> >
> > Hmmm so now bpf programs can cause kernel bugs by keeping a reference around?
>
> BPF verifier will reject any program that cannot guarantee that
> bpf_task_release() will always be called. So there shouldn't be any
> problem here.

Thanks for the clarification.

>
> >
> > This feels extremely dodgy, I don't like this at all.
> >
> > I thought the whole point of BPF was that this kind of thing couldn't possibly
> > happen?
> >
> > Or would this be a kernel bug?
> >
> > If a bpf program can lead to a refcount not being put, this is not
> > upstreamable surely?
> >
> > > + *
> > > + * Return: The associated task_struct on success, or NULL on failure. Note that
> > > + * this function depends on CONFIG_MEMCG being enabled - it will always return
> > > + * NULL if CONFIG_MEMCG is not configured.
> > > + */
> > > +__bpf_kfunc struct task_struct *bpf_mm_get_task(struct mm_struct *mm)
> > > +{
> > > +#ifdef CONFIG_MEMCG
> > > +     struct task_struct *task;
> > > +
> > > +     if (!mm)
> > > +             return NULL;
> > > +     rcu_read_lock();
> > > +     task = rcu_dereference(mm->owner);
>
> Question to Yafang, though. Instead of adding new kfunc just for this,
> have you tried marking mm->owner as BTF_TYPE_SAFE_TRUSTED_OR_NULL,
> which, if I understand correctly, would allow BPF program to just work
> with `mm->owner` (after checking for NULL) directly. And then you can
> just use existing bpf_task_acquire()

good suggestion.
will change it.

>
> >
> > > +     if (!task)
> > > +             goto out;
> > > +     if (!refcount_inc_not_zero(&task->rcu_users))
> > > +             goto out;
>
> nit: just call bpf_task_acquire(), which will more obviously pair with
> suggested bpf_task_release()?

makes sense.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 02/10] mm: thp: add a new kfunc bpf_mm_get_mem_cgroup()
  2025-08-27 15:34   ` Lorenzo Stoakes
  2025-08-27 20:50     ` Shakeel Butt
@ 2025-08-28  6:57     ` Yafang Shao
  2025-08-28 10:42       ` Lorenzo Stoakes
  1 sibling, 1 reply; 61+ messages in thread
From: Yafang Shao @ 2025-08-28  6:57 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, bpf, linux-mm,
	linux-doc, Michal Hocko, Roman Gushchin, Shakeel Butt

On Wed, Aug 27, 2025 at 11:34 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> +cc cgroup people, please do include them on this stuff.

sure.

>
> BTW I see there is a BPF [STORAGE & CGROUPS] section in MAINTAINERS and
> kernel/bpf/cgroup.c etc. anything useful there for us?

BPF local storage can assist in implementing this feature. However, we
still need to introduce a new helper, bpf_mm_get_mem_cgroup(), to
retrieve the mem_cgroup from an mm_struct.

>
> On Tue, Aug 26, 2025 at 03:19:40PM +0800, Yafang Shao wrote:
> > We will utilize this new kfunc bpf_mm_get_mem_cgroup() to retrieve the
> > associated mem_cgroup from the given @mm. The obtained mem_cgroup must
> > be released by calling bpf_put_mem_cgroup() as a paired operation.
>
> What locking guarantees do we have that this is all fine?

As explained by Shakeel,  no locking is needed for this stuff.

>
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  mm/bpf_thp.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++++-
>
> Also not to be nitty (but I'm going to be anyway :P) but I'm not in love with
> the filename here.
>
> So now we have
>
> - khugepaged.c
> - huge_memory.c
> - bpf_thp.c
>
> Let's maybe call it huge_memory_bpf.c for consistency?

makes sense.

> And obv as mentioned
> before, add it to the MAINTAINERS in the THP section plz.

will do it.


-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 02/10] mm: thp: add a new kfunc bpf_mm_get_mem_cgroup()
  2025-08-27 20:45   ` Shakeel Butt
@ 2025-08-28  6:58     ` Yafang Shao
  0 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2025-08-28  6:58 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, bpf, linux-mm, linux-doc

On Thu, Aug 28, 2025 at 4:46 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Tue, Aug 26, 2025 at 03:19:40PM +0800, Yafang Shao wrote:
> > We will utilize this new kfunc bpf_mm_get_mem_cgroup() to retrieve the
> > associated mem_cgroup from the given @mm. The obtained mem_cgroup must
> > be released by calling bpf_put_mem_cgroup() as a paired operation.
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  mm/bpf_thp.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 50 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
> > index fbff3b1bb988..b757e8f425fd 100644
> > --- a/mm/bpf_thp.c
> > +++ b/mm/bpf_thp.c
> > @@ -175,10 +175,59 @@ static struct bpf_struct_ops bpf_bpf_thp_ops = {
> >       .name = "bpf_thp_ops",
> >  };
> >
> > +__bpf_kfunc_start_defs();
> > +
> > +/**
> > + * bpf_mm_get_mem_cgroup - Get the memory cgroup associated with a mm_struct.
> > + * @mm: The mm_struct to query
> > + *
> > + * The obtained mem_cgroup must be released by calling bpf_put_mem_cgroup().
> > + *
> > + * Return: The associated mem_cgroup on success, or NULL on failure. Note that
> > + * this function depends on CONFIG_MEMCG being enabled - it will always return
> > + * NULL if CONFIG_MEMCG is not configured.
> > + */
> > +__bpf_kfunc struct mem_cgroup *bpf_mm_get_mem_cgroup(struct mm_struct *mm)
> > +{
> > +     return get_mem_cgroup_from_mm(mm);
> > +}
> > +
> > +/**
> > + * bpf_put_mem_cgroup - Release a memory cgroup obtained from bpf_mm_get_mem_cgroup()
> > + * @memcg: The memory cgroup to release
> > + */
> > +__bpf_kfunc void bpf_put_mem_cgroup(struct mem_cgroup *memcg)
> > +{
> > +#ifdef CONFIG_MEMCG
> > +     if (!memcg)
> > +             return;
> > +     css_put(&memcg->css);
> > +#endif
>
> Just use mem_cgroup_put() here.

i will change it and thanks for the clarification.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 02/10] mm: thp: add a new kfunc bpf_mm_get_mem_cgroup()
  2025-08-27 20:50     ` Shakeel Butt
@ 2025-08-28 10:40       ` Lorenzo Stoakes
  2025-08-28 16:00         ` Shakeel Butt
  0 siblings, 1 reply; 61+ messages in thread
From: Lorenzo Stoakes @ 2025-08-28 10:40 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Yafang Shao, akpm, david, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
	willy, ast, daniel, andrii, ameryhung, rientjes, corbet, bpf,
	linux-mm, linux-doc, Michal Hocko, Roman Gushchin

On Wed, Aug 27, 2025 at 01:50:18PM -0700, Shakeel Butt wrote:
> On Wed, Aug 27, 2025 at 04:34:48PM +0100, Lorenzo Stoakes wrote:
> > > +__bpf_kfunc_start_defs();
> > > +
> > > +/**
> > > + * bpf_mm_get_mem_cgroup - Get the memory cgroup associated with a mm_struct.
> > > + * @mm: The mm_struct to query
> > > + *
> > > + * The obtained mem_cgroup must be released by calling bpf_put_mem_cgroup().
> > > + *
> > > + * Return: The associated mem_cgroup on success, or NULL on failure. Note that
> > > + * this function depends on CONFIG_MEMCG being enabled - it will always return
> > > + * NULL if CONFIG_MEMCG is not configured.
> >
> > What kind of locking is assumed here?
> >
> > Are we protected against mmdrop() clearing out the mm?
>
> No locking is needed. Just the valid mm object or NULL. Usually the
> underlying function (get_mem_cgroup_from_mm) is called in page fault
> context where the current is holding mm. Here the only requirement is
> that mm is valid either through explicit reference or the context.

I mean this may be down to me being not so familiar with BPF, but my concern is
that we're handing _any_ mm here.

So presumably this could also be a remote mm?

If not then why are we accepting an mm parameter at all, when we could just grab
current->mm?

If it's a remote mm, then we need to be absolutely sure that we won't UAF.

I also feel we should talk about this in the kdoc, unless BPF always somehow
asserts these things to be the case + verifies them smoehow.

>
> >
> > > + */
> > > +__bpf_kfunc struct mem_cgroup *bpf_mm_get_mem_cgroup(struct mm_struct *mm)
> > > +{
> > > +	return get_mem_cgroup_from_mm(mm);
> > > +}
> > > +
> > > +/**
> > > + * bpf_put_mem_cgroup - Release a memory cgroup obtained from bpf_mm_get_mem_cgroup()
> > > + * @memcg: The memory cgroup to release
> > > + */
> > > +__bpf_kfunc void bpf_put_mem_cgroup(struct mem_cgroup *memcg)
> > > +{
> > > +#ifdef CONFIG_MEMCG
> > > +	if (!memcg)
> > > +		return;
> > > +	css_put(&memcg->css);
> >
> > Feels weird to have an ifdef here but not elsewhere, maybe the whole thing
> > should be ifdef...?
> >
> > Is there not a put equivalent for get_mem_cgroup_from_mm()? That is a bit weird.
> >
> > Also do we now refrence the memcg global? That's pretty gross, could we not
> > actually implement such a helper?
> >
> > Is it valid to do this also? Maybe cgroup people can chime in.
>
> There is mem_cgroup_put() which should handle !CONFIG_MEMCG configs.
>

OK Yafang - let's use this instead then?


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 02/10] mm: thp: add a new kfunc bpf_mm_get_mem_cgroup()
  2025-08-28  6:57     ` Yafang Shao
@ 2025-08-28 10:42       ` Lorenzo Stoakes
  2025-08-29  3:09         ` Yafang Shao
  0 siblings, 1 reply; 61+ messages in thread
From: Lorenzo Stoakes @ 2025-08-28 10:42 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, bpf, linux-mm,
	linux-doc, Michal Hocko, Roman Gushchin, Shakeel Butt

On Thu, Aug 28, 2025 at 02:57:03PM +0800, Yafang Shao wrote:
> On Wed, Aug 27, 2025 at 11:34 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > +cc cgroup people, please do include them on this stuff.
>
> sure.

Be good to cc on future respins for the whole series also! :) just so everybody
is in the loop, thanks!

>
> >
> > BTW I see there is a BPF [STORAGE & CGROUPS] section in MAINTAINERS and
> > kernel/bpf/cgroup.c etc. anything useful there for us?
>
> BPF local storage can assist in implementing this feature. However, we
> still need to introduce a new helper, bpf_mm_get_mem_cgroup(), to
> retrieve the mem_cgroup from an mm_struct.
>
> >
> > On Tue, Aug 26, 2025 at 03:19:40PM +0800, Yafang Shao wrote:
> > > We will utilize this new kfunc bpf_mm_get_mem_cgroup() to retrieve the
> > > associated mem_cgroup from the given @mm. The obtained mem_cgroup must
> > > be released by calling bpf_put_mem_cgroup() as a paired operation.
> >
> > What locking guarantees do we have that this is all fine?
>
> As explained by Shakeel,  no locking is needed for this stuff.

Thanks, I responded there.

>
> >
> > >
> > > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > > ---
> > >  mm/bpf_thp.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++++-
> >
> > Also not to be nitty (but I'm going to be anyway :P) but I'm not in love with
> > the filename here.
> >
> > So now we have
> >
> > - khugepaged.c
> > - huge_memory.c
> > - bpf_thp.c
> >
> > Let's maybe call it huge_memory_bpf.c for consistency?
>
> makes sense.
>
> > And obv as mentioned
> > before, add it to the MAINTAINERS in the THP section plz.
>
> will do it.

Thank you on both! :)

>
>
> --
> Regards
> Yafang

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 01/10] mm: thp: add support for BPF based THP order selection
  2025-08-28  5:54     ` Yafang Shao
@ 2025-08-28 10:50       ` Lorenzo Stoakes
  2025-08-29  3:01         ` Yafang Shao
  0 siblings, 1 reply; 61+ messages in thread
From: Lorenzo Stoakes @ 2025-08-28 10:50 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, bpf, linux-mm,
	linux-doc

On Thu, Aug 28, 2025 at 01:54:39PM +0800, Yafang Shao wrote:
> On Wed, Aug 27, 2025 at 11:03 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Tue, Aug 26, 2025 at 03:19:39PM +0800, Yafang Shao wrote:
> > > This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
> > > THP tuning. It includes a hook get_suggested_order() [0], allowing BPF
> > > programs to influence THP order selection based on factors such as:
> > > - Workload identity
> > >   For example, workloads running in specific containers or cgroups.
> > > - Allocation context
> > >   Whether the allocation occurs during a page fault, khugepaged, or other
> > >   paths.
> > > - System memory pressure
> > >   (May require new BPF helpers to accurately assess memory pressure.)
> > >
> > > Key Details:
> > > - Only one BPF program can be attached at a time, but it can be updated
> > >   dynamically to adjust the policy.
> > > - Supports automatic mTHP order selection and per-workload THP policies.
> > > - Only functional when THP is set to madise or always.
> > >
> > > It requires CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION to enable. [1]
> > > This feature is unstable and may evolve in future kernel versions.
> > >
> > > Link: https://lwn.net/ml/all/9bc57721-5287-416c-aa30-46932d605f63@redhat.com/ [0]
> > > Link: https://lwn.net/ml/all/dda67ea5-2943-497c-a8e5-d81f0733047d@lucifer.local/ [1]
> > >
> > > Suggested-by: David Hildenbrand <david@redhat.com>
> > > Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > > ---
> > >  include/linux/huge_mm.h    |  15 +++
> > >  include/linux/khugepaged.h |  12 ++-
> > >  mm/Kconfig                 |  12 +++
> > >  mm/Makefile                |   1 +
> > >  mm/bpf_thp.c               | 186 +++++++++++++++++++++++++++++++++++++
> >
> > Please add new files to MAINTAINERS as you add them.
>
> will do it.
>
> >
> > >  mm/huge_memory.c           |  10 ++
> > >  mm/khugepaged.c            |  26 +++++-
> > >  mm/memory.c                |  18 +++-
> > >  8 files changed, 273 insertions(+), 7 deletions(-)
> > >  create mode 100644 mm/bpf_thp.c
> > >
> > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > > index 1ac0d06fb3c1..f0c91d7bd267 100644
> > > --- a/include/linux/huge_mm.h
> > > +++ b/include/linux/huge_mm.h
> > > @@ -6,6 +6,8 @@
> > >
> > >  #include <linux/fs.h> /* only for vma_is_dax() */
> > >  #include <linux/kobject.h>
> > > +#include <linux/pgtable.h>
> > > +#include <linux/mm.h>
> >
> > Hm this is a bit weird as mm.h includes huge_mm... I guess it will be handled by
> > header defines but still.
>
> Some refactoring is needed for these two header files, but we can
> handle it separately later.
>
> >
> > >
> > >  vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
> > >  int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > > @@ -56,6 +58,7 @@ enum transparent_hugepage_flag {
> > >       TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
> > >       TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
> > >       TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
> > > +     TRANSPARENT_HUGEPAGE_BPF_ATTACHED,      /* BPF prog is attached */
> > >  };
> > >
> > >  struct kobject;
> > > @@ -195,6 +198,18 @@ static inline bool hugepage_global_always(void)
> > >                       (1<<TRANSPARENT_HUGEPAGE_FLAG);
> > >  }
> > >
> > > +#ifdef CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION
> > > +int get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> > > +                     u64 vma_flags, enum tva_type tva_flags, int orders);
> >
> > Not a massive fan of this naming to be honest. I think it should explicitly
> > reference bpf, e.g. bpf_hook_thp_get_order() or something.
>
> will change it to bpf_hook_thp_get_orders().

Thanks!

>
> >
> > Right now this is super unclear as to what it's for.
> >
> > Also wrt vma_flags - this type is wrong :) it's vm_flags_t and going to change
> > to a bitmap of unlimiiteeed size soon. So probs best not to pass around as value
> > type either.
>
> As replied in another thread. I will change it.

Thanks. Will check the other thread.

>
> >
> > But unclear us to purpose as mentioned elsewhere.
> >
> > And also get_suggested_order() should be get_suggested_orderS() no? As you
> > seem later in the code to be referencing a bitfield?
>
> Right, it should be bpf_hook_thp_get_orderS().

Thanks!

>
> >
> > Also will mm ever != vma->vm_mm?
>
> No it can't. It can be guaranteed by the caller.

In this case we don't need to pass mm separately then right?

>
> >
> > Are we hacking this for the sake of overloading what this does?
>
> The @vma is actually unneeded. I will remove it.

Ah OK.

I am still a little concerned about passing around a value reference to the VMA
flags though, esp as this type can + will change in future (not sure what that
means for BPF).

We may go to e.g. a 128 bit bitmap there etc.


>
> >
> > Also if we're returning a bitmask of orders which you seem to be (not sure I
> > like that tbh - I feel like we shoudl simply provide one order but open for
> > disucssion) - shouldn't it return an unsigned long?
>
> We are indifferent to whether a single order or a bitmask is returned,
> as we only use order-0 and order-9. We have no use cases for
> middle-order pages, though this feature might be useful for other
> architectures or for some special use cases.

Well surely we want to potentially specify a mTHP under certain circumstances
no?

In any case I feel it's worth making any bitfield a system word size.

>
> >
> > > +#else
> > > +static inline int
> > > +get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> > > +                 u64 vma_flags, enum tva_type tva_flags, int orders)
> > > +{
> > > +     return orders;
> > > +}
> > > +#endif
> > > +
> > >  static inline int highest_order(unsigned long orders)
> > >  {
> > >       return fls_long(orders) - 1;
> > > diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> > > index eb1946a70cff..d81c1228a21f 100644
> > > --- a/include/linux/khugepaged.h
> > > +++ b/include/linux/khugepaged.h
> > > @@ -4,6 +4,8 @@
> > >
> > >  #include <linux/mm.h>
> > >
> > > +#include <linux/huge_mm.h>
> > > +
> >
> > Hm this is iffy too, There's probably a reason we didn't include this before,
> > the headers can be so so fragile. Let's be cautious...
>
> I will check.

Thanks!

>
> >
> > >  extern unsigned int khugepaged_max_ptes_none __read_mostly;
> > >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > >  extern struct attribute_group khugepaged_attr_group;
> > > @@ -22,7 +24,15 @@ extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> > >
> > >  static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
> > >  {
> > > -     if (mm_flags_test(MMF_VM_HUGEPAGE, oldmm))
> > > +     /*
> > > +      * THP allocation policy can be dynamically modified via BPF. Even if a
> > > +      * task was allowed to allocate THPs, BPF can decide whether its forked
> > > +      * child can allocate THPs.
> > > +      *
> > > +      * The MMF_VM_HUGEPAGE flag will be cleared by khugepaged.
> > > +      */
> > > +     if (mm_flags_test(MMF_VM_HUGEPAGE, oldmm) &&
> > > +             get_suggested_order(mm, NULL, 0, -1, BIT(PMD_ORDER)))
> >
> > Hmmm so there seems to be some kind of additional functionality you're providing
> > here kinda quietly, which is to allow the exact same interface to determine
> > whether we kick off khugepaged or not.
> >
> > Don't love that, I think we should be hugely specific about that.
> >
> > This bpf interface should literally be 'ok we're deciding what order we
> > want'. It feels like a bit of a gross overloading?
>
> This makes sense. I have no objection to reverting to returning a single order.

OK but key point here is - we're now determining if a forked child can _not_
allocate THPs using this function.

To me this should be a separate function rather than some _weird_ usage of this
same function.

And generally at this point I think we should just drop this bit of code
honestly.

>
> >
> > >               __khugepaged_enter(mm);
> > >  }
> > >
> > > diff --git a/mm/Kconfig b/mm/Kconfig
> > > index 4108bcd96784..d10089e3f181 100644
> > > --- a/mm/Kconfig
> > > +++ b/mm/Kconfig
> > > @@ -924,6 +924,18 @@ config NO_PAGE_MAPCOUNT
> > >
> > >         EXPERIMENTAL because the impact of some changes is still unclear.
> > >
> > > +config EXPERIMENTAL_BPF_ORDER_SELECTION
> > > +     bool "BPF-based THP order selection (EXPERIMENTAL)"
> > > +     depends on TRANSPARENT_HUGEPAGE && BPF_SYSCALL
> > > +
> > > +     help
> > > +       Enable dynamic THP order selection using BPF programs. This
> > > +       experimental feature allows custom BPF logic to determine optimal
> > > +       transparent hugepage allocation sizes at runtime.
> > > +
> > > +       Warning: This feature is unstable and may change in future kernel
> > > +       versions.
> >
> > Thanks! This is important to document. Absolute nitty nit: can you capitalise
> > 'WARNING'? Thanks!
>
> will do it.

Thanks!

>
> >
> > > +
> > >  endif # TRANSPARENT_HUGEPAGE
> > >
> > >  # simple helper to make the code a bit easier to read
> > > diff --git a/mm/Makefile b/mm/Makefile
> > > index ef54aa615d9d..cb55d1509be1 100644
> > > --- a/mm/Makefile
> > > +++ b/mm/Makefile
> > > @@ -99,6 +99,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
> > >  obj-$(CONFIG_NUMA) += memory-tiers.o
> > >  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
> > >  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> > > +obj-$(CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION) += bpf_thp.o
> > >  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
> > >  obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
> > >  obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
> > > diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
> > > new file mode 100644
> > > index 000000000000..fbff3b1bb988
> > > --- /dev/null
> > > +++ b/mm/bpf_thp.c
> >
> > As mentioned before, please update MAINTAINERS for new files. I went to great +
> > painful lengths to get everything listed there so let's keep it that way please
> > :P
>
> will do it.

Thanks!

>
> >
> > > @@ -0,0 +1,186 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +
> > > +#include <linux/bpf.h>
> > > +#include <linux/btf.h>
> > > +#include <linux/huge_mm.h>
> > > +#include <linux/khugepaged.h>
> > > +
> > > +struct bpf_thp_ops {
> > > +     /**
> > > +      * @get_suggested_order: Get the suggested THP orders for allocation
> > > +      * @mm: mm_struct associated with the THP allocation
> > > +      * @vma__nullable: vm_area_struct associated with the THP allocation (may be NULL)
> > > +      *                 When NULL, the decision should be based on @mm (i.e., when
> > > +      *                 triggered from an mm-scope hook rather than a VMA-specific
> > > +      *                 context).
> > > +      *                 Must belong to @mm (guaranteed by the caller).
> > > +      * @vma_flags: use these vm_flags instead of @vma->vm_flags (0 if @vma is NULL)
> > > +      * @tva_flags: TVA flags for current @vma (-1 if @vma is NULL)
> > > +      * @orders: Bitmask of requested THP orders for this allocation
> > > +      *          - PMD-mapped allocation if PMD_ORDER is set
> > > +      *          - mTHP allocation otherwise
> > > +      *
> > > +      * Rerurn: Bitmask of suggested THP orders for allocation. The highest
> > > +      *         suggested order will not exceed the highest requested order
> > > +      *         in @orders.
> > > +      */
> > > +     int (*get_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> > > +                                u64 vma_flags, enum tva_type tva_flags, int orders) __rcu;
> >
> > I feel like we should be declaring this function pointer type somewhere else as
> > we're now duplicating this in two places.
>
> agreed, I have already done it to fix the spare warning.

Thanks!

>
> >
> > > +};
> > > +
> > > +static struct bpf_thp_ops bpf_thp;
> > > +static DEFINE_SPINLOCK(thp_ops_lock);
> > > +
> > > +int get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> > > +                     u64 vma_flags, enum tva_type tva_flags, int orders)
> >
> > surely tva_flag? As this is an enum value?
>
> will change it to tva_type instead.

Thanks!

>
> >
> > > +{
> > > +     int (*bpf_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> > > +                                u64 vma_flags, enum tva_type tva_flags, int orders);
> >
> > This type for vma flags is totally incorrect. vm_flags_t. And that's going to
> > change soon to an opaque type.
> >
> > Also right now it's actually an unsigned long.
> >
> > I really really do not like that we're providing extra, unexplained VMA flags
> > for some reason. I may be missing something :) so happy to hear why this is
> > necessary.
> >
> > However in future we really shouldn't be passing something like this.
>
> will change it as replied in another thread.

Thanks!

>
> >
> > Also - now a third duplication of the same function pointer :) can we do better
> > than this? At least typedef it.
> >
> > > +     int suggested_orders = orders;
> > > +
> > > +     /* No BPF program is attached */
> > > +     if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> > > +                   &transparent_hugepage_flags))
> > > +             return suggested_orders;
> >
> > This is atomic ofc, but are we concerned about races, or I guess you expect only
> > the first attached bpf program to work with it I suppose.
>
> It is against the race to unreg or update.

OK cool, it does make sense overall.

>
> >
> > > +
> > > +     rcu_read_lock();
> >
> > Is this sufficient? Anything stopping the mm or VMA going away here?
>
> This RCU lock is not for protecting the mm or VMA structures
> themselves, but for protecting the update of the function pointer.
> Arbitrary access to pointers within the mm_struct or vm_area_struct is
> prohibited, as they are guarded by the BPF verifier.
>
> >
> > > +     bpf_suggested_order = rcu_dereference(bpf_thp.get_suggested_order);
> > > +     if (!bpf_suggested_order)
> > > +             goto out;
> > > +
> > > +     suggested_orders = bpf_suggested_order(mm, vma__nullable, vma_flags, tva_flags, orders);
> >
> > OK so now it's suggested order_S but we're invoking suggested order :) whaaatt?
> > :)
>
> will change it.

Thanks!

>
> >
> > > +     if (highest_order(suggested_orders) > highest_order(orders))
> > > +             suggested_orders = orders;
> >
> > Hmmm so the semantics are - whichever is the highest order wins?
>
> The maximum requested order is determined by the callsite. For example:
> - PMD-mapped THP uses PMD_ORDER
> - mTHP uses (PMD_ORDER - 1)
>
> We must respect this upper bound to avoid undefined behavior. So the
> highest suggested order can't exceed the highest requested order.

OK, please document this in a comment here.

>
> >
> > I thought the idea was we'd hand control over to bpf if provided in effect?
> >
> > Definitely worth going over these semantics in the cover letter (and do forgive
> > me if you have and I've missed! :)
>
> It has already in the cover letter:
>
>  * Return: Bitmask of suggested THP orders for allocation. The highest
>  *         suggested order will not exceed the highest requested order
>  *         in @orders.

OK cool thanks, a comment here would be useful also.

>
>
> >
> > > +
> > > +out:
> > > +     rcu_read_unlock();
> > > +     return suggested_orders;
> > > +}
> > > +
> > > +static bool bpf_thp_ops_is_valid_access(int off, int size,
> > > +                                     enum bpf_access_type type,
> > > +                                     const struct bpf_prog *prog,
> > > +                                     struct bpf_insn_access_aux *info)
> > > +{
> > > +     return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
> > > +}
> > > +
> > > +static const struct bpf_func_proto *
> > > +bpf_thp_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> > > +{
> > > +     return bpf_base_func_proto(func_id, prog);
> > > +}
> > > +
> > > +static const struct bpf_verifier_ops thp_bpf_verifier_ops = {
> > > +     .get_func_proto = bpf_thp_get_func_proto,
> > > +     .is_valid_access = bpf_thp_ops_is_valid_access,
> > > +};
> > > +
> > > +static int bpf_thp_init(struct btf *btf)
> > > +{
> > > +     return 0;
> > > +}
> > > +
> > > +static int bpf_thp_init_member(const struct btf_type *t,
> > > +                            const struct btf_member *member,
> > > +                            void *kdata, const void *udata)
> > > +{
> > > +     return 0;
> > > +}
> > > +
> > > +static int bpf_thp_reg(void *kdata, struct bpf_link *link)
> > > +{
> > > +     struct bpf_thp_ops *ops = kdata;
> > > +
> > > +     spin_lock(&thp_ops_lock);
> > > +     if (test_and_set_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> > > +                          &transparent_hugepage_flags)) {
> > > +             spin_unlock(&thp_ops_lock);
> > > +             return -EBUSY;
> > > +     }
> > > +     WARN_ON_ONCE(rcu_access_pointer(bpf_thp.get_suggested_order));
> > > +     rcu_assign_pointer(bpf_thp.get_suggested_order, ops->get_suggested_order);
> > > +     spin_unlock(&thp_ops_lock);
> > > +     return 0;
> > > +}
> > > +
> > > +static void bpf_thp_unreg(void *kdata, struct bpf_link *link)
> > > +{
> > > +     spin_lock(&thp_ops_lock);
> > > +     clear_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, &transparent_hugepage_flags);
> > > +     WARN_ON_ONCE(!rcu_access_pointer(bpf_thp.get_suggested_order));
> > > +     rcu_replace_pointer(bpf_thp.get_suggested_order, NULL, lockdep_is_held(&thp_ops_lock));
> > > +     spin_unlock(&thp_ops_lock);
> > > +
> > > +     synchronize_rcu();
> > > +}
> >
> > I am a total beginner with BPF implementations so don't feel like I can say much
> > intelligent about the above. But presumably fairly standard fare BPF-wise?
>
> This implementation is necessary to support BPF program updates.

Ack.

>
> >
> > Will perhaps try to dig deeper on another iteration :) as intersting to me.
> >
> > > +
> > > +static int bpf_thp_update(void *kdata, void *old_kdata, struct bpf_link *link)
> > > +{
> > > +     struct bpf_thp_ops *ops = kdata;
> > > +     struct bpf_thp_ops *old = old_kdata;
> > > +     int ret = 0;
> > > +
> > > +     if (!ops || !old)
> > > +             return -EINVAL;
> > > +
> > > +     spin_lock(&thp_ops_lock);
> > > +     /* The prog has aleady been removed. */
> > > +     if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, &transparent_hugepage_flags)) {
> > > +             ret = -ENOENT;
> > > +             goto out;
> > > +     }
> >
> > OK so we gate things on this flag and it's global, got it.
> >
> > I see this is a hook, and I guess RCU-all-the-things is what BPF does which
> > makes tonnes of sense.
> >
> > > +     WARN_ON_ONCE(!rcu_access_pointer(bpf_thp.get_suggested_order));
> > > +     rcu_replace_pointer(bpf_thp.get_suggested_order, ops->get_suggested_order,
> > > +                         lockdep_is_held(&thp_ops_lock));
> > > +
> > > +out:
> > > +     spin_unlock(&thp_ops_lock);
> > > +     if (!ret)
> > > +             synchronize_rcu();
> > > +     return ret;
> > > +}
> > > +
> > > +static int bpf_thp_validate(void *kdata)
> > > +{
> > > +     struct bpf_thp_ops *ops = kdata;
> > > +
> > > +     if (!ops->get_suggested_order) {
> > > +             pr_err("bpf_thp: required ops isn't implemented\n");
> > > +             return -EINVAL;
> > > +     }
> > > +     return 0;
> > > +}
> > > +
> > > +static int suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> > > +                        u64 vma_flags, enum tva_type vm_flags, int orders)
> > > +{
> > > +     return orders;
> > > +}
> > > +
> > > +static struct bpf_thp_ops __bpf_thp_ops = {
> > > +     .get_suggested_order = suggested_order,
> > > +};
> >
> > Can you explain to me what this stub stuff is for? This is more 'BPF impl 101'
> > stuff sorry :)
>
> It is a CFI stub. cfi_stubs in BPF struct_ops are secure intermediary
> functions that prevent the kernel from making direct, unsafe jumps to
> BPF code. A new attached BPF program will run via this stub.

Ack.

>
> >
> > > +
> > > +static struct bpf_struct_ops bpf_bpf_thp_ops = {
> > > +     .verifier_ops = &thp_bpf_verifier_ops,
> > > +     .init = bpf_thp_init,
> > > +     .init_member = bpf_thp_init_member,
> > > +     .reg = bpf_thp_reg,
> > > +     .unreg = bpf_thp_unreg,
> > > +     .update = bpf_thp_update,
> > > +     .validate = bpf_thp_validate,
> > > +     .cfi_stubs = &__bpf_thp_ops,
> > > +     .owner = THIS_MODULE,
> > > +     .name = "bpf_thp_ops",
> > > +};
> > > +
> > > +static int __init bpf_thp_ops_init(void)
> > > +{
> > > +     int err = register_bpf_struct_ops(&bpf_bpf_thp_ops, bpf_thp_ops);
> > > +
> > > +     if (err)
> > > +             pr_err("bpf_thp: Failed to register struct_ops (%d)\n", err);
> > > +     return err;
> > > +}
> > > +late_initcall(bpf_thp_ops_init);
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index d89992b65acc..bd8f8f34ab3c 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -1349,6 +1349,16 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> > >               return ret;
> > >       khugepaged_enter_vma(vma, vma->vm_flags);
> > >
> > > +     /*
> > > +      * This check must occur after khugepaged_enter_vma() because:
> > > +      * 1. We may permit THP allocation via khugepaged
> > > +      * 2. While simultaneously disallowing THP allocation
> > > +      *    during page fault handling
> > > +      */
> > > +     if (get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_PAGEFAULT, BIT(PMD_ORDER)) !=
> > > +                             BIT(PMD_ORDER))
> >
> > Hmmm so you return a bitmask of orders, but then you only allow this fault if
> > the only order provided is PMD order? That seems strange. Can you explain?
>
> This is in the do_huge_pmd_anonymous_page() that can only accept a PMD
> order, otherwise it might result in unexpected behavior.

OK please document this in the comment.

>
> >
> > > +             return VM_FAULT_FALLBACK;
> >
> > It'd be good to have a helper function for this like:
> >
> >         if (!bpf_hook_allow_pmd_order(vma, tva_flag))
> >                 return VM_FAULT_FALLBACK;
> >
> > And implemented like maybe:
> >
> > static bool bpf_hook_allow_pmd_order(struct vm_area_struct *vma, enum tva_type tva_flag)
> > {
> >         int orders = get_suggested_order(vma->vm_mm, vma, vma->vm_flags, tva_flag,
> >                         BIT(PMD_ORDER));
> >
> >         return orders & BIT(PMD_ORDER);
> > }
> >
> > It's good the tva flag gives context though.
>
> Thanks for the suggestion.
> will change it.


Thanks!

>
> >
> > > +
> > >       if (!(vmf->flags & FAULT_FLAG_WRITE) &&
> > >                       !mm_forbids_zeropage(vma->vm_mm) &&
> > >                       transparent_hugepage_use_zero_page()) {
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>
> > > index d3d4f116e14b..935583626db6 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -474,7 +474,9 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
> > >  {
> > >       if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
> > >           hugepage_pmd_enabled()) {
> > > -             if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
> > > +             if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER) &&
> > > +                 get_suggested_order(vma->vm_mm, vma, vm_flags, TVA_KHUGEPAGED,
> > > +                                     BIT(PMD_ORDER)))
> >
> > I don't know why we aren't working the bpf hook into thp_vma_allowable_order()?
>
> Actually it can be added into thp_vma_allowable_order().  I will change it.

Thanks!

>
> >
> > Also a helper would work here.
> >
> > >                       __khugepaged_enter(vma->vm_mm);
> > >       }
> > >  }
> > > @@ -934,6 +936,8 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> > >               return SCAN_ADDRESS_RANGE;
> > >       if (!thp_vma_allowable_order(vma, vma->vm_flags, type, PMD_ORDER))
> > >               return SCAN_VMA_CHECK;
> > > +     if (!get_suggested_order(vma->vm_mm, vma, vma->vm_flags, type, BIT(PMD_ORDER)))
> > > +             return SCAN_VMA_CHECK;
> >
> >
> >
> > >       /*
> > >        * Anon VMA expected, the address may be unmapped then
> > >        * remapped to file after khugepaged reaquired the mmap_lock.
> > > @@ -1465,6 +1469,11 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
> > >               /* khugepaged_mm_lock actually not necessary for the below */
> > >               mm_slot_free(mm_slot_cache, mm_slot);
> > >               mmdrop(mm);
> > > +     } else if (!get_suggested_order(mm, NULL, 0, -1, BIT(PMD_ORDER))) {
> > > +             hash_del(&slot->hash);
> > > +             list_del(&slot->mm_node);
> > > +             mm_flags_clear(MMF_VM_HUGEPAGE, mm);
> > > +             mm_slot_free(mm_slot_cache, mm_slot);
> > >       }
> > >  }
> > >
> > > @@ -1538,6 +1547,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> > >       if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
> > >               return SCAN_VMA_CHECK;
> > >
> > > +     if (!get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_FORCED_COLLAPSE,
> > > +                              BIT(PMD_ORDER)))
> >
> > Again, can we please not duplicate thp_vma_allowable_order() logic?
> >
> > The THP code is horrible enough, but now we have to remember to also do the bpf
> > check?
>
> makes sense.
>
> >
> > > +             return SCAN_VMA_CHECK;
> > >       /* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
> > >       if (userfaultfd_wp(vma))
> > >               return SCAN_PTE_UFFD_WP;
> > > @@ -2416,6 +2428,10 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> > >        * the next mm on the list.
> > >        */
> > >       vma = NULL;
> > > +
> > > +     /* If this mm is not suitable for the scan list, we should remove it. */
> > > +     if (!get_suggested_order(mm, NULL, 0, -1, BIT(PMD_ORDER)))
> > > +             goto breakouterloop_mmap_lock;
> >
> > OK again I'm really not loving this NULL, 0, -1 stuff. What is this supposed to
> > mean? The idea here is we have a hook for 'trying to determine THP order' and
> > now it's overloaded it seems in multiple ways?
> >
> > I may be missing context here.
> >
> > I'm also a bit perplexed by the comment as to what is intended here.
>
> Using a BPF-based approach for THP adjustment allows us to dynamically
> enable or disable THP for running applications without causing any
> disruption. This capability is particularly valuable in production
> environments. The logic here is designed to achieve exactly that.
>
>
> >
> > >       if (unlikely(!mmap_read_trylock(mm)))
> > >               goto breakouterloop_mmap_lock;
> > >
> > > @@ -2432,7 +2448,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> > >                       progress++;
> > >                       break;
> > >               }
> > > -             if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
> > > +             if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER) ||
> > > +                 !get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_KHUGEPAGED,
> > > +                                      BIT(PMD_ORDER))) {
> >
> > Same various comments from above.
>
> will change it.
>
> >
> > >  skip:
> > >                       progress++;
> > >                       continue;
> > > @@ -2769,6 +2787,10 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> > >       if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
> > >               return -EINVAL;
> > >
> > > +     if (!get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_FORCED_COLLAPSE,
> > > +                              BIT(PMD_ORDER)))
> > > +             return -EINVAL;
> > > +
> >
> > Same various comments from above.
>
> will change it.
>
> >
> > >       cc = kmalloc(sizeof(*cc), GFP_KERNEL);
> > >       if (!cc)
> > >               return -ENOMEM;
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index d9de6c056179..0178857aa058 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -4486,6 +4486,7 @@ static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset,
> > >  static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> > >  {
> > >       struct vm_area_struct *vma = vmf->vma;
> > > +     int order, suggested_orders;
> > >       unsigned long orders;
> > >       struct folio *folio;
> > >       unsigned long addr;
> > > @@ -4493,7 +4494,6 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> > >       spinlock_t *ptl;
> > >       pte_t *pte;
> > >       gfp_t gfp;
> > > -     int order;
> > >
> > >       /*
> > >        * If uffd is active for the vma we need per-page fault fidelity to
> > > @@ -4510,13 +4510,18 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> > >       if (!zswap_never_enabled())
> > >               goto fallback;
> > >
> > > +     suggested_orders = get_suggested_order(vma->vm_mm, vma, vma->vm_flags,
> > > +                                            TVA_PAGEFAULT,
> > > +                                            BIT(PMD_ORDER) - 1);
> > > +     if (!suggested_orders)
> > > +             goto fallback;
> >

(Thanks for all above! :)

> > Wait, but below we have a bunch of fallbacks, now BPF overrides everything?
>
> When allocating high-order pages is not feasible, such as during
> periods of high memory pressure, the system should immediately fall
> back to using 4 KB pages.

OK makes sense.

>
> >
> > I know I'm repaeting myself :P but can we just please put this into
> > thp_vma_allowable_orders(), it's massively gross to just duplicate this check
> > _everywhere_ with subtle differences.
>
> will change it.

Thanks

>
> >
> > >       entry = pte_to_swp_entry(vmf->orig_pte);
> > >       /*
> > >        * Get a list of all the (large) orders below PMD_ORDER that are enabled
> > >        * and suitable for swapping THP.
> > >        */
> > >       orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT,
> > > -                                       BIT(PMD_ORDER) - 1);
> > > +                                       suggested_orders);
> > >       orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> > >       orders = thp_swap_suitable_orders(swp_offset(entry),
> > >                                         vmf->address, orders);
> > > @@ -5044,12 +5049,12 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> > >  {
> > >       struct vm_area_struct *vma = vmf->vma;
> > >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > +     int order, suggested_orders;
> > >       unsigned long orders;
> > >       struct folio *folio;
> > >       unsigned long addr;
> > >       pte_t *pte;
> > >       gfp_t gfp;
> > > -     int order;
> > >
> > >       /*
> > >        * If uffd is active for the vma we need per-page fault fidelity to
> > > @@ -5058,13 +5063,18 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> > >       if (unlikely(userfaultfd_armed(vma)))
> > >               goto fallback;
> > >
> > > +     suggested_orders = get_suggested_order(vma->vm_mm, vma, vma->vm_flags,
> > > +                                            TVA_PAGEFAULT,
> > > +                                            BIT(PMD_ORDER) - 1);
> > > +     if (!suggested_orders)
> > > +             goto fallback;
> >
> > Same comment as above.
>
> will change it.

Thanks!

>
>
> Thanks a lot for your comments.

No problem, thanks for the series!

I am generally excited about exploring this, so once we figure out details be
good to see where this can go!

>
>
> --
> Regards
>
> Yafang


Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 03/10] mm: thp: add a new kfunc bpf_mm_get_task()
  2025-08-27 21:50     ` Andrii Nakryiko
  2025-08-28  6:50       ` Yafang Shao
@ 2025-08-28 10:51       ` Lorenzo Stoakes
  2025-08-29  3:15         ` Yafang Shao
  1 sibling, 1 reply; 61+ messages in thread
From: Lorenzo Stoakes @ 2025-08-28 10:51 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Yafang Shao, akpm, david, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
	willy, ast, daniel, andrii, ameryhung, rientjes, corbet, bpf,
	linux-mm, linux-doc

On Wed, Aug 27, 2025 at 02:50:36PM -0700, Andrii Nakryiko wrote:
> On Wed, Aug 27, 2025 at 8:48 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Tue, Aug 26, 2025 at 03:19:41PM +0800, Yafang Shao wrote:
> > > We will utilize this new kfunc bpf_mm_get_task() to retrieve the
> > > associated task_struct from the given @mm. The obtained task_struct must
> > > be released by calling bpf_task_release() as a paired operation.
> >
> > You're basically describing the patch you're not saying why - yeah you're
> > getting a task struct from an mm (only if CONFIG_MEMCG which you don't
> > mention here), but not for what purpose you intend to use this?
> >
> > >
> > > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > > ---
> > >  mm/bpf_thp.c | 34 ++++++++++++++++++++++++++++++++++
> > >  1 file changed, 34 insertions(+)
> > >
> > > diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
> > > index b757e8f425fd..46b3bc96359e 100644
> > > --- a/mm/bpf_thp.c
> > > +++ b/mm/bpf_thp.c
> > > @@ -205,11 +205,45 @@ __bpf_kfunc void bpf_put_mem_cgroup(struct mem_cgroup *memcg)
> > >  #endif
> > >  }
> > >
> > > +/**
> > > + * bpf_mm_get_task - Get the task struct associated with a mm_struct.
> > > + * @mm: The mm_struct to query
> > > + *
> > > + * The obtained task_struct must be released by calling bpf_task_release().
> >
> > Hmmm so now bpf programs can cause kernel bugs by keeping a reference around?
>
> BPF verifier will reject any program that cannot guarantee that
> bpf_task_release() will always be called. So there shouldn't be any
> problem here.

Ah that's nice!

What specifically here is enforcing that? Apologies again - BPF is new to me.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 04/10] bpf: mark vma->vm_mm as trusted
  2025-08-28  6:12     ` Yafang Shao
@ 2025-08-28 11:11       ` Lorenzo Stoakes
  2025-08-29  3:05         ` Yafang Shao
  0 siblings, 1 reply; 61+ messages in thread
From: Lorenzo Stoakes @ 2025-08-28 11:11 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, bpf, linux-mm,
	linux-doc

On Thu, Aug 28, 2025 at 02:12:12PM +0800, Yafang Shao wrote:
> On Wed, Aug 27, 2025 at 11:46 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Tue, Aug 26, 2025 at 03:19:42PM +0800, Yafang Shao wrote:
> > > Every VMA must have an associated mm_struct, and it is safe to access
> >
> > Err this isn't true? Pretty sure special VMAs don't have that set.
>
> I’m not aware of any VMA that doesn’t belong to an mm_struct. If there
> is such a case, it would be helpful if you could point it out. In any
> case, I’ll remove the VMA-related code in the next version since it’s
> unnecessary.

If you lok at get_vma_name() in fs/proc/task_mmu.c you'll see:

	if (!vma->vm_mm) {
		*name = "[vdso]";
		return;
	}

So a VDSO will have this condition.

I did a quick drgn()/printk() test and didn't see any, but maybe my system - but
in any case this appears to be a valid situation that can arise, presumably
because it's a VMA somehow shared with multiple mm's or something truly god
awful like that :)

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 02/10] mm: thp: add a new kfunc bpf_mm_get_mem_cgroup()
  2025-08-28 10:40       ` Lorenzo Stoakes
@ 2025-08-28 16:00         ` Shakeel Butt
  2025-08-29 10:45           ` Lorenzo Stoakes
  0 siblings, 1 reply; 61+ messages in thread
From: Shakeel Butt @ 2025-08-28 16:00 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Yafang Shao, akpm, david, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
	willy, ast, daniel, andrii, ameryhung, rientjes, corbet, bpf,
	linux-mm, linux-doc, Michal Hocko, Roman Gushchin

On Thu, Aug 28, 2025 at 11:40:16AM +0100, Lorenzo Stoakes wrote:
> On Wed, Aug 27, 2025 at 01:50:18PM -0700, Shakeel Butt wrote:
> > On Wed, Aug 27, 2025 at 04:34:48PM +0100, Lorenzo Stoakes wrote:
> > > > +__bpf_kfunc_start_defs();
> > > > +
> > > > +/**
> > > > + * bpf_mm_get_mem_cgroup - Get the memory cgroup associated with a mm_struct.
> > > > + * @mm: The mm_struct to query
> > > > + *
> > > > + * The obtained mem_cgroup must be released by calling bpf_put_mem_cgroup().
> > > > + *
> > > > + * Return: The associated mem_cgroup on success, or NULL on failure. Note that
> > > > + * this function depends on CONFIG_MEMCG being enabled - it will always return
> > > > + * NULL if CONFIG_MEMCG is not configured.
> > >
> > > What kind of locking is assumed here?
> > >
> > > Are we protected against mmdrop() clearing out the mm?
> >
> > No locking is needed. Just the valid mm object or NULL. Usually the
> > underlying function (get_mem_cgroup_from_mm) is called in page fault
> > context where the current is holding mm. Here the only requirement is
> > that mm is valid either through explicit reference or the context.
> 
> I mean this may be down to me being not so familiar with BPF, but my concern is
> that we're handing _any_ mm here.

It's not really any mm but rather the mm whose validity is ensured by
the caller. I don't know the BPF internals but if I understand Andrii's
response on other email, the BPF verifier will make sure the BPF program
is holding a valid mm on which it is calling this function. In non-BPF
world, get_mem_cgroup_from_mm() assumes the caller is providing a valid
mm.

> 
> So presumably this could also be a remote mm?

Which is fine as we already do this today i.e. page fault on accessing
memory of a remote process.

> 
> If not then why are we accepting an mm parameter at all, when we could just grab
> current->mm?

Because current->mm might not be equal to the faulting mm as in the case
of remote page fault.

> 
> If it's a remote mm, then we need to be absolutely sure that we won't UAF.
> 
> I also feel we should talk about this in the kdoc, unless BPF always somehow
> asserts these things to be the case + verifies them smoehow.
> 

Yeah some text on how BPF verifier is making sure that the BPF program
is handling a valid mm.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 01/10] mm: thp: add support for BPF based THP order selection
  2025-08-28 10:50       ` Lorenzo Stoakes
@ 2025-08-29  3:01         ` Yafang Shao
  2025-08-29 10:42           ` Lorenzo Stoakes
  0 siblings, 1 reply; 61+ messages in thread
From: Yafang Shao @ 2025-08-29  3:01 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, bpf, linux-mm,
	linux-doc

On Thu, Aug 28, 2025 at 6:50 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Thu, Aug 28, 2025 at 01:54:39PM +0800, Yafang Shao wrote:
> > On Wed, Aug 27, 2025 at 11:03 PM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > On Tue, Aug 26, 2025 at 03:19:39PM +0800, Yafang Shao wrote:
> > > > This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
> > > > THP tuning. It includes a hook get_suggested_order() [0], allowing BPF
> > > > programs to influence THP order selection based on factors such as:
> > > > - Workload identity
> > > >   For example, workloads running in specific containers or cgroups.
> > > > - Allocation context
> > > >   Whether the allocation occurs during a page fault, khugepaged, or other
> > > >   paths.
> > > > - System memory pressure
> > > >   (May require new BPF helpers to accurately assess memory pressure.)
> > > >
> > > > Key Details:
> > > > - Only one BPF program can be attached at a time, but it can be updated
> > > >   dynamically to adjust the policy.
> > > > - Supports automatic mTHP order selection and per-workload THP policies.
> > > > - Only functional when THP is set to madise or always.
> > > >
> > > > It requires CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION to enable. [1]
> > > > This feature is unstable and may evolve in future kernel versions.
> > > >
> > > > Link: https://lwn.net/ml/all/9bc57721-5287-416c-aa30-46932d605f63@redhat.com/ [0]
> > > > Link: https://lwn.net/ml/all/dda67ea5-2943-497c-a8e5-d81f0733047d@lucifer.local/ [1]
> > > >
> > > > Suggested-by: David Hildenbrand <david@redhat.com>
> > > > Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > > > ---
> > > >  include/linux/huge_mm.h    |  15 +++
> > > >  include/linux/khugepaged.h |  12 ++-
> > > >  mm/Kconfig                 |  12 +++
> > > >  mm/Makefile                |   1 +
> > > >  mm/bpf_thp.c               | 186 +++++++++++++++++++++++++++++++++++++
> > >
> > > Please add new files to MAINTAINERS as you add them.
> >
> > will do it.
> >
> > >
> > > >  mm/huge_memory.c           |  10 ++
> > > >  mm/khugepaged.c            |  26 +++++-
> > > >  mm/memory.c                |  18 +++-
> > > >  8 files changed, 273 insertions(+), 7 deletions(-)
> > > >  create mode 100644 mm/bpf_thp.c
> > > >
> > > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > > > index 1ac0d06fb3c1..f0c91d7bd267 100644
> > > > --- a/include/linux/huge_mm.h
> > > > +++ b/include/linux/huge_mm.h
> > > > @@ -6,6 +6,8 @@
> > > >
> > > >  #include <linux/fs.h> /* only for vma_is_dax() */
> > > >  #include <linux/kobject.h>
> > > > +#include <linux/pgtable.h>
> > > > +#include <linux/mm.h>
> > >
> > > Hm this is a bit weird as mm.h includes huge_mm... I guess it will be handled by
> > > header defines but still.
> >
> > Some refactoring is needed for these two header files, but we can
> > handle it separately later.
> >
> > >
> > > >
> > > >  vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
> > > >  int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > > > @@ -56,6 +58,7 @@ enum transparent_hugepage_flag {
> > > >       TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
> > > >       TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
> > > >       TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
> > > > +     TRANSPARENT_HUGEPAGE_BPF_ATTACHED,      /* BPF prog is attached */
> > > >  };
> > > >
> > > >  struct kobject;
> > > > @@ -195,6 +198,18 @@ static inline bool hugepage_global_always(void)
> > > >                       (1<<TRANSPARENT_HUGEPAGE_FLAG);
> > > >  }
> > > >
> > > > +#ifdef CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION
> > > > +int get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> > > > +                     u64 vma_flags, enum tva_type tva_flags, int orders);
> > >
> > > Not a massive fan of this naming to be honest. I think it should explicitly
> > > reference bpf, e.g. bpf_hook_thp_get_order() or something.
> >
> > will change it to bpf_hook_thp_get_orders().
>
> Thanks!
>
> >
> > >
> > > Right now this is super unclear as to what it's for.
> > >
> > > Also wrt vma_flags - this type is wrong :) it's vm_flags_t and going to change
> > > to a bitmap of unlimiiteeed size soon. So probs best not to pass around as value
> > > type either.
> >
> > As replied in another thread. I will change it.
>
> Thanks. Will check the other thread.
>
> >
> > >
> > > But unclear us to purpose as mentioned elsewhere.
> > >
> > > And also get_suggested_order() should be get_suggested_orderS() no? As you
> > > seem later in the code to be referencing a bitfield?
> >
> > Right, it should be bpf_hook_thp_get_orderS().
>
> Thanks!
>
> >
> > >
> > > Also will mm ever != vma->vm_mm?
> >
> > No it can't. It can be guaranteed by the caller.
>
> In this case we don't need to pass mm separately then right?

Right, we need to pass either @mm or @vma. However, there are cases
where vma information is not available at certain call sites, such as
in khugepaged. In those cases, we need to pass @mm instead.

>
> >
> > >
> > > Are we hacking this for the sake of overloading what this does?
> >
> > The @vma is actually unneeded. I will remove it.
>
> Ah OK.
>
> I am still a little concerned about passing around a value reference to the VMA
> flags though, esp as this type can + will change in future (not sure what that
> means for BPF).
>
> We may go to e.g. a 128 bit bitmap there etc.

As mentioned in another thread, we only need to determine whether the
flag is VM_HUGEPAGE or VM_NOHUGEPAGE, so it can be simplified.

>
>
> >
> > >
> > > Also if we're returning a bitmask of orders which you seem to be (not sure I
> > > like that tbh - I feel like we shoudl simply provide one order but open for
> > > disucssion) - shouldn't it return an unsigned long?
> >
> > We are indifferent to whether a single order or a bitmask is returned,
> > as we only use order-0 and order-9. We have no use cases for
> > middle-order pages, though this feature might be useful for other
> > architectures or for some special use cases.
>
> Well surely we want to potentially specify a mTHP under certain circumstances
> no?

Perhaps there are use cases, but I haven’t found any use cases for
this in our production environment. On the other hand, I can clearly
see a risk that it could lead to more costly high-order allocations.

>
> In any case I feel it's worth making any bitfield a system word size.
>
> >
> > >
> > > > +#else
> > > > +static inline int
> > > > +get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> > > > +                 u64 vma_flags, enum tva_type tva_flags, int orders)
> > > > +{
> > > > +     return orders;
> > > > +}
> > > > +#endif
> > > > +
> > > >  static inline int highest_order(unsigned long orders)
> > > >  {
> > > >       return fls_long(orders) - 1;
> > > > diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> > > > index eb1946a70cff..d81c1228a21f 100644
> > > > --- a/include/linux/khugepaged.h
> > > > +++ b/include/linux/khugepaged.h
> > > > @@ -4,6 +4,8 @@
> > > >
> > > >  #include <linux/mm.h>
> > > >
> > > > +#include <linux/huge_mm.h>
> > > > +
> > >
> > > Hm this is iffy too, There's probably a reason we didn't include this before,
> > > the headers can be so so fragile. Let's be cautious...
> >
> > I will check.
>
> Thanks!
>
> >
> > >
> > > >  extern unsigned int khugepaged_max_ptes_none __read_mostly;
> > > >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > >  extern struct attribute_group khugepaged_attr_group;
> > > > @@ -22,7 +24,15 @@ extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> > > >
> > > >  static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
> > > >  {
> > > > -     if (mm_flags_test(MMF_VM_HUGEPAGE, oldmm))
> > > > +     /*
> > > > +      * THP allocation policy can be dynamically modified via BPF. Even if a
> > > > +      * task was allowed to allocate THPs, BPF can decide whether its forked
> > > > +      * child can allocate THPs.
> > > > +      *
> > > > +      * The MMF_VM_HUGEPAGE flag will be cleared by khugepaged.
> > > > +      */
> > > > +     if (mm_flags_test(MMF_VM_HUGEPAGE, oldmm) &&
> > > > +             get_suggested_order(mm, NULL, 0, -1, BIT(PMD_ORDER)))
> > >
> > > Hmmm so there seems to be some kind of additional functionality you're providing
> > > here kinda quietly, which is to allow the exact same interface to determine
> > > whether we kick off khugepaged or not.
> > >
> > > Don't love that, I think we should be hugely specific about that.
> > >
> > > This bpf interface should literally be 'ok we're deciding what order we
> > > want'. It feels like a bit of a gross overloading?
> >
> > This makes sense. I have no objection to reverting to returning a single order.
>
> OK but key point here is - we're now determining if a forked child can _not_
> allocate THPs using this function.
>
> To me this should be a separate function rather than some _weird_ usage of this
> same function.

Perhaps a separate function is better.

>
> And generally at this point I think we should just drop this bit of code
> honestly.

MMF_VM_HUGEPAGE is set when the THP mode is "always" or "madvise". If
it’s set, any forked child processes will inherit this flag. It is
only cleared when the mm_struct is destroyed (please correct me if I’m
wrong).

However, when you switch the THP mode to "never", tasks that still
have MMF_VM_HUGEPAGE remain on the khugepaged scan list. This isn’t an
issue under the current global mode because khugepaged doesn’t run
when THP is set to "never".

The problem arises when we move from a global mode to a per-task mode.
In that case, khugepaged may end up doing unnecessary work. For
example, if the THP mode is "always", but some tasks are not allowed
to allocate THP while still having MMF_VM_HUGEPAGE set, khugepaged
will continue scanning them unnecessarily.

To avoid this, we should prevent setting this flag for child processes
if they are not allowed to allocate THP in the first place. This way,
khugepaged won’t waste cycles scanning them. While an alternative
approach would be to set the flag at fork and later clear it for
khugepaged, it’s clearly more efficient to avoid setting it from the
start.

>
> >
> > >
> > > >               __khugepaged_enter(mm);
> > > >  }
> > > >
> > > > diff --git a/mm/Kconfig b/mm/Kconfig
> > > > index 4108bcd96784..d10089e3f181 100644
> > > > --- a/mm/Kconfig
> > > > +++ b/mm/Kconfig
> > > > @@ -924,6 +924,18 @@ config NO_PAGE_MAPCOUNT
> > > >
> > > >         EXPERIMENTAL because the impact of some changes is still unclear.
> > > >
> > > > +config EXPERIMENTAL_BPF_ORDER_SELECTION
> > > > +     bool "BPF-based THP order selection (EXPERIMENTAL)"
> > > > +     depends on TRANSPARENT_HUGEPAGE && BPF_SYSCALL
> > > > +
> > > > +     help
> > > > +       Enable dynamic THP order selection using BPF programs. This
> > > > +       experimental feature allows custom BPF logic to determine optimal
> > > > +       transparent hugepage allocation sizes at runtime.
> > > > +
> > > > +       Warning: This feature is unstable and may change in future kernel
> > > > +       versions.
> > >
> > > Thanks! This is important to document. Absolute nitty nit: can you capitalise
> > > 'WARNING'? Thanks!
> >
> > will do it.
>
> Thanks!
>
> >
> > >
> > > > +
> > > >  endif # TRANSPARENT_HUGEPAGE
> > > >
> > > >  # simple helper to make the code a bit easier to read
> > > > diff --git a/mm/Makefile b/mm/Makefile
> > > > index ef54aa615d9d..cb55d1509be1 100644
> > > > --- a/mm/Makefile
> > > > +++ b/mm/Makefile
> > > > @@ -99,6 +99,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
> > > >  obj-$(CONFIG_NUMA) += memory-tiers.o
> > > >  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
> > > >  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> > > > +obj-$(CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION) += bpf_thp.o
> > > >  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
> > > >  obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
> > > >  obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
> > > > diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
> > > > new file mode 100644
> > > > index 000000000000..fbff3b1bb988
> > > > --- /dev/null
> > > > +++ b/mm/bpf_thp.c
> > >
> > > As mentioned before, please update MAINTAINERS for new files. I went to great +
> > > painful lengths to get everything listed there so let's keep it that way please
> > > :P
> >
> > will do it.
>
> Thanks!
>
> >
> > >
> > > > @@ -0,0 +1,186 @@
> > > > +// SPDX-License-Identifier: GPL-2.0
> > > > +
> > > > +#include <linux/bpf.h>
> > > > +#include <linux/btf.h>
> > > > +#include <linux/huge_mm.h>
> > > > +#include <linux/khugepaged.h>
> > > > +
> > > > +struct bpf_thp_ops {
> > > > +     /**
> > > > +      * @get_suggested_order: Get the suggested THP orders for allocation
> > > > +      * @mm: mm_struct associated with the THP allocation
> > > > +      * @vma__nullable: vm_area_struct associated with the THP allocation (may be NULL)
> > > > +      *                 When NULL, the decision should be based on @mm (i.e., when
> > > > +      *                 triggered from an mm-scope hook rather than a VMA-specific
> > > > +      *                 context).
> > > > +      *                 Must belong to @mm (guaranteed by the caller).
> > > > +      * @vma_flags: use these vm_flags instead of @vma->vm_flags (0 if @vma is NULL)
> > > > +      * @tva_flags: TVA flags for current @vma (-1 if @vma is NULL)
> > > > +      * @orders: Bitmask of requested THP orders for this allocation
> > > > +      *          - PMD-mapped allocation if PMD_ORDER is set
> > > > +      *          - mTHP allocation otherwise
> > > > +      *
> > > > +      * Rerurn: Bitmask of suggested THP orders for allocation. The highest
> > > > +      *         suggested order will not exceed the highest requested order
> > > > +      *         in @orders.
> > > > +      */
> > > > +     int (*get_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> > > > +                                u64 vma_flags, enum tva_type tva_flags, int orders) __rcu;
> > >
> > > I feel like we should be declaring this function pointer type somewhere else as
> > > we're now duplicating this in two places.
> >
> > agreed, I have already done it to fix the spare warning.
>
> Thanks!
>
> >
> > >
> > > > +};
> > > > +
> > > > +static struct bpf_thp_ops bpf_thp;
> > > > +static DEFINE_SPINLOCK(thp_ops_lock);
> > > > +
> > > > +int get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> > > > +                     u64 vma_flags, enum tva_type tva_flags, int orders)
> > >
> > > surely tva_flag? As this is an enum value?
> >
> > will change it to tva_type instead.
>
> Thanks!
>
> >
> > >
> > > > +{
> > > > +     int (*bpf_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> > > > +                                u64 vma_flags, enum tva_type tva_flags, int orders);
> > >
> > > This type for vma flags is totally incorrect. vm_flags_t. And that's going to
> > > change soon to an opaque type.
> > >
> > > Also right now it's actually an unsigned long.
> > >
> > > I really really do not like that we're providing extra, unexplained VMA flags
> > > for some reason. I may be missing something :) so happy to hear why this is
> > > necessary.
> > >
> > > However in future we really shouldn't be passing something like this.
> >
> > will change it as replied in another thread.
>
> Thanks!
>
> >
> > >
> > > Also - now a third duplication of the same function pointer :) can we do better
> > > than this? At least typedef it.
> > >
> > > > +     int suggested_orders = orders;
> > > > +
> > > > +     /* No BPF program is attached */
> > > > +     if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> > > > +                   &transparent_hugepage_flags))
> > > > +             return suggested_orders;
> > >
> > > This is atomic ofc, but are we concerned about races, or I guess you expect only
> > > the first attached bpf program to work with it I suppose.
> >
> > It is against the race to unreg or update.
>
> OK cool, it does make sense overall.
>
> >
> > >
> > > > +
> > > > +     rcu_read_lock();
> > >
> > > Is this sufficient? Anything stopping the mm or VMA going away here?
> >
> > This RCU lock is not for protecting the mm or VMA structures
> > themselves, but for protecting the update of the function pointer.
> > Arbitrary access to pointers within the mm_struct or vm_area_struct is
> > prohibited, as they are guarded by the BPF verifier.
> >
> > >
> > > > +     bpf_suggested_order = rcu_dereference(bpf_thp.get_suggested_order);
> > > > +     if (!bpf_suggested_order)
> > > > +             goto out;
> > > > +
> > > > +     suggested_orders = bpf_suggested_order(mm, vma__nullable, vma_flags, tva_flags, orders);
> > >
> > > OK so now it's suggested order_S but we're invoking suggested order :) whaaatt?
> > > :)
> >
> > will change it.
>
> Thanks!
>
> >
> > >
> > > > +     if (highest_order(suggested_orders) > highest_order(orders))
> > > > +             suggested_orders = orders;
> > >
> > > Hmmm so the semantics are - whichever is the highest order wins?
> >
> > The maximum requested order is determined by the callsite. For example:
> > - PMD-mapped THP uses PMD_ORDER
> > - mTHP uses (PMD_ORDER - 1)
> >
> > We must respect this upper bound to avoid undefined behavior. So the
> > highest suggested order can't exceed the highest requested order.
>
> OK, please document this in a comment here.

will doc it.

>
> >
> > >
> > > I thought the idea was we'd hand control over to bpf if provided in effect?
> > >
> > > Definitely worth going over these semantics in the cover letter (and do forgive
> > > me if you have and I've missed! :)
> >
> > It has already in the cover letter:
> >
> >  * Return: Bitmask of suggested THP orders for allocation. The highest
> >  *         suggested order will not exceed the highest requested order
> >  *         in @orders.
>
> OK cool thanks, a comment here would be useful also.

will add it.

>
> >
> >
> > >
> > > > +
> > > > +out:
> > > > +     rcu_read_unlock();
> > > > +     return suggested_orders;
> > > > +}
> > > > +
> > > > +static bool bpf_thp_ops_is_valid_access(int off, int size,
> > > > +                                     enum bpf_access_type type,
> > > > +                                     const struct bpf_prog *prog,
> > > > +                                     struct bpf_insn_access_aux *info)
> > > > +{
> > > > +     return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
> > > > +}
> > > > +
> > > > +static const struct bpf_func_proto *
> > > > +bpf_thp_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> > > > +{
> > > > +     return bpf_base_func_proto(func_id, prog);
> > > > +}
> > > > +
> > > > +static const struct bpf_verifier_ops thp_bpf_verifier_ops = {
> > > > +     .get_func_proto = bpf_thp_get_func_proto,
> > > > +     .is_valid_access = bpf_thp_ops_is_valid_access,
> > > > +};
> > > > +
> > > > +static int bpf_thp_init(struct btf *btf)
> > > > +{
> > > > +     return 0;
> > > > +}
> > > > +
> > > > +static int bpf_thp_init_member(const struct btf_type *t,
> > > > +                            const struct btf_member *member,
> > > > +                            void *kdata, const void *udata)
> > > > +{
> > > > +     return 0;
> > > > +}
> > > > +
> > > > +static int bpf_thp_reg(void *kdata, struct bpf_link *link)
> > > > +{
> > > > +     struct bpf_thp_ops *ops = kdata;
> > > > +
> > > > +     spin_lock(&thp_ops_lock);
> > > > +     if (test_and_set_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> > > > +                          &transparent_hugepage_flags)) {
> > > > +             spin_unlock(&thp_ops_lock);
> > > > +             return -EBUSY;
> > > > +     }
> > > > +     WARN_ON_ONCE(rcu_access_pointer(bpf_thp.get_suggested_order));
> > > > +     rcu_assign_pointer(bpf_thp.get_suggested_order, ops->get_suggested_order);
> > > > +     spin_unlock(&thp_ops_lock);
> > > > +     return 0;
> > > > +}
> > > > +
> > > > +static void bpf_thp_unreg(void *kdata, struct bpf_link *link)
> > > > +{
> > > > +     spin_lock(&thp_ops_lock);
> > > > +     clear_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, &transparent_hugepage_flags);
> > > > +     WARN_ON_ONCE(!rcu_access_pointer(bpf_thp.get_suggested_order));
> > > > +     rcu_replace_pointer(bpf_thp.get_suggested_order, NULL, lockdep_is_held(&thp_ops_lock));
> > > > +     spin_unlock(&thp_ops_lock);
> > > > +
> > > > +     synchronize_rcu();
> > > > +}
> > >
> > > I am a total beginner with BPF implementations so don't feel like I can say much
> > > intelligent about the above. But presumably fairly standard fare BPF-wise?
> >
> > This implementation is necessary to support BPF program updates.
>
> Ack.
>
> >
> > >
> > > Will perhaps try to dig deeper on another iteration :) as intersting to me.
> > >
> > > > +
> > > > +static int bpf_thp_update(void *kdata, void *old_kdata, struct bpf_link *link)
> > > > +{
> > > > +     struct bpf_thp_ops *ops = kdata;
> > > > +     struct bpf_thp_ops *old = old_kdata;
> > > > +     int ret = 0;
> > > > +
> > > > +     if (!ops || !old)
> > > > +             return -EINVAL;
> > > > +
> > > > +     spin_lock(&thp_ops_lock);
> > > > +     /* The prog has aleady been removed. */
> > > > +     if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, &transparent_hugepage_flags)) {
> > > > +             ret = -ENOENT;
> > > > +             goto out;
> > > > +     }
> > >
> > > OK so we gate things on this flag and it's global, got it.
> > >
> > > I see this is a hook, and I guess RCU-all-the-things is what BPF does which
> > > makes tonnes of sense.
> > >
> > > > +     WARN_ON_ONCE(!rcu_access_pointer(bpf_thp.get_suggested_order));
> > > > +     rcu_replace_pointer(bpf_thp.get_suggested_order, ops->get_suggested_order,
> > > > +                         lockdep_is_held(&thp_ops_lock));
> > > > +
> > > > +out:
> > > > +     spin_unlock(&thp_ops_lock);
> > > > +     if (!ret)
> > > > +             synchronize_rcu();
> > > > +     return ret;
> > > > +}
> > > > +
> > > > +static int bpf_thp_validate(void *kdata)
> > > > +{
> > > > +     struct bpf_thp_ops *ops = kdata;
> > > > +
> > > > +     if (!ops->get_suggested_order) {
> > > > +             pr_err("bpf_thp: required ops isn't implemented\n");
> > > > +             return -EINVAL;
> > > > +     }
> > > > +     return 0;
> > > > +}
> > > > +
> > > > +static int suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> > > > +                        u64 vma_flags, enum tva_type vm_flags, int orders)
> > > > +{
> > > > +     return orders;
> > > > +}
> > > > +
> > > > +static struct bpf_thp_ops __bpf_thp_ops = {
> > > > +     .get_suggested_order = suggested_order,
> > > > +};
> > >
> > > Can you explain to me what this stub stuff is for? This is more 'BPF impl 101'
> > > stuff sorry :)
> >
> > It is a CFI stub. cfi_stubs in BPF struct_ops are secure intermediary
> > functions that prevent the kernel from making direct, unsafe jumps to
> > BPF code. A new attached BPF program will run via this stub.
>
> Ack.
>
> >
> > >
> > > > +
> > > > +static struct bpf_struct_ops bpf_bpf_thp_ops = {
> > > > +     .verifier_ops = &thp_bpf_verifier_ops,
> > > > +     .init = bpf_thp_init,
> > > > +     .init_member = bpf_thp_init_member,
> > > > +     .reg = bpf_thp_reg,
> > > > +     .unreg = bpf_thp_unreg,
> > > > +     .update = bpf_thp_update,
> > > > +     .validate = bpf_thp_validate,
> > > > +     .cfi_stubs = &__bpf_thp_ops,
> > > > +     .owner = THIS_MODULE,
> > > > +     .name = "bpf_thp_ops",
> > > > +};
> > > > +
> > > > +static int __init bpf_thp_ops_init(void)
> > > > +{
> > > > +     int err = register_bpf_struct_ops(&bpf_bpf_thp_ops, bpf_thp_ops);
> > > > +
> > > > +     if (err)
> > > > +             pr_err("bpf_thp: Failed to register struct_ops (%d)\n", err);
> > > > +     return err;
> > > > +}
> > > > +late_initcall(bpf_thp_ops_init);
> > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > index d89992b65acc..bd8f8f34ab3c 100644
> > > > --- a/mm/huge_memory.c
> > > > +++ b/mm/huge_memory.c
> > > > @@ -1349,6 +1349,16 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> > > >               return ret;
> > > >       khugepaged_enter_vma(vma, vma->vm_flags);
> > > >
> > > > +     /*
> > > > +      * This check must occur after khugepaged_enter_vma() because:
> > > > +      * 1. We may permit THP allocation via khugepaged
> > > > +      * 2. While simultaneously disallowing THP allocation
> > > > +      *    during page fault handling
> > > > +      */
> > > > +     if (get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_PAGEFAULT, BIT(PMD_ORDER)) !=
> > > > +                             BIT(PMD_ORDER))
> > >
> > > Hmmm so you return a bitmask of orders, but then you only allow this fault if
> > > the only order provided is PMD order? That seems strange. Can you explain?
> >
> > This is in the do_huge_pmd_anonymous_page() that can only accept a PMD
> > order, otherwise it might result in unexpected behavior.
>
> OK please document this in the comment.

will doc it.


--
Regards
Yafang


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 04/10] bpf: mark vma->vm_mm as trusted
  2025-08-28 11:11       ` Lorenzo Stoakes
@ 2025-08-29  3:05         ` Yafang Shao
  2025-08-29 10:49           ` Lorenzo Stoakes
  0 siblings, 1 reply; 61+ messages in thread
From: Yafang Shao @ 2025-08-29  3:05 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, bpf, linux-mm,
	linux-doc

On Thu, Aug 28, 2025 at 7:11 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Thu, Aug 28, 2025 at 02:12:12PM +0800, Yafang Shao wrote:
> > On Wed, Aug 27, 2025 at 11:46 PM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > On Tue, Aug 26, 2025 at 03:19:42PM +0800, Yafang Shao wrote:
> > > > Every VMA must have an associated mm_struct, and it is safe to access
> > >
> > > Err this isn't true? Pretty sure special VMAs don't have that set.
> >
> > I’m not aware of any VMA that doesn’t belong to an mm_struct. If there
> > is such a case, it would be helpful if you could point it out. In any
> > case, I’ll remove the VMA-related code in the next version since it’s
> > unnecessary.
>
> If you lok at get_vma_name() in fs/proc/task_mmu.c you'll see:
>
>         if (!vma->vm_mm) {
>                 *name = "[vdso]";
>                 return;
>         }
>
> So a VDSO will have this condition.
>
> I did a quick drgn()/printk() test and didn't see any, but maybe my system - but
> in any case this appears to be a valid situation that can arise, presumably
> because it's a VMA somehow shared with multiple mm's or something truly god
> awful like that :)

Thanks for clarifying that.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 02/10] mm: thp: add a new kfunc bpf_mm_get_mem_cgroup()
  2025-08-28 10:42       ` Lorenzo Stoakes
@ 2025-08-29  3:09         ` Yafang Shao
  0 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2025-08-29  3:09 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, bpf, linux-mm,
	linux-doc, Michal Hocko, Roman Gushchin, Shakeel Butt

On Thu, Aug 28, 2025 at 6:42 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Thu, Aug 28, 2025 at 02:57:03PM +0800, Yafang Shao wrote:
> > On Wed, Aug 27, 2025 at 11:34 PM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > +cc cgroup people, please do include them on this stuff.
> >
> > sure.
>
> Be good to cc on future respins for the whole series also! :) just so everybody
> is in the loop, thanks!

Thanks for the reminder. I'll include them in the next version.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 03/10] mm: thp: add a new kfunc bpf_mm_get_task()
  2025-08-28 10:51       ` Lorenzo Stoakes
@ 2025-08-29  3:15         ` Yafang Shao
  2025-08-29 10:42           ` Lorenzo Stoakes
  0 siblings, 1 reply; 61+ messages in thread
From: Yafang Shao @ 2025-08-29  3:15 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrii Nakryiko, akpm, david, ziy, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, bpf, linux-mm, linux-doc

On Thu, Aug 28, 2025 at 6:51 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Wed, Aug 27, 2025 at 02:50:36PM -0700, Andrii Nakryiko wrote:
> > On Wed, Aug 27, 2025 at 8:48 AM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > On Tue, Aug 26, 2025 at 03:19:41PM +0800, Yafang Shao wrote:
> > > > We will utilize this new kfunc bpf_mm_get_task() to retrieve the
> > > > associated task_struct from the given @mm. The obtained task_struct must
> > > > be released by calling bpf_task_release() as a paired operation.
> > >
> > > You're basically describing the patch you're not saying why - yeah you're
> > > getting a task struct from an mm (only if CONFIG_MEMCG which you don't
> > > mention here), but not for what purpose you intend to use this?
> > >
> > > >
> > > > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > > > ---
> > > >  mm/bpf_thp.c | 34 ++++++++++++++++++++++++++++++++++
> > > >  1 file changed, 34 insertions(+)
> > > >
> > > > diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
> > > > index b757e8f425fd..46b3bc96359e 100644
> > > > --- a/mm/bpf_thp.c
> > > > +++ b/mm/bpf_thp.c
> > > > @@ -205,11 +205,45 @@ __bpf_kfunc void bpf_put_mem_cgroup(struct mem_cgroup *memcg)
> > > >  #endif
> > > >  }
> > > >
> > > > +/**
> > > > + * bpf_mm_get_task - Get the task struct associated with a mm_struct.
> > > > + * @mm: The mm_struct to query
> > > > + *
> > > > + * The obtained task_struct must be released by calling bpf_task_release().
> > >
> > > Hmmm so now bpf programs can cause kernel bugs by keeping a reference around?
> >
> > BPF verifier will reject any program that cannot guarantee that
> > bpf_task_release() will always be called. So there shouldn't be any
> > problem here.
>
> Ah that's nice!
>
> What specifically here is enforcing that? Apologies again - BPF is new to me.

The KF_ACQUIRE and KF_RELEASE flags enforce resource management. If a
BPF helper function (e.g., bpf_mm_get_task()) is marked with
KF_ACQUIRE, the pointer it returns must be released by a corresponding
helper marked with KF_RELEASE (e.g., bpf_task_release()). The BPF
verifier will reject any program that fails to pair these calls
correctly.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 01/10] mm: thp: add support for BPF based THP order selection
  2025-08-26  7:19 ` [PATCH v6 mm-new 01/10] mm: thp: add support for " Yafang Shao
  2025-08-27  2:57   ` kernel test robot
  2025-08-27 15:03   ` Lorenzo Stoakes
@ 2025-08-29  4:56   ` Barry Song
  2025-08-29  5:36     ` Yafang Shao
  2 siblings, 1 reply; 61+ messages in thread
From: Barry Song @ 2025-08-29  4:56 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, bpf, linux-mm, linux-doc

On Tue, Aug 26, 2025 at 3:43 PM Yafang Shao <laoar.shao@gmail.com> wrote:
[...]

> @@ -4510,13 +4510,18 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>         if (!zswap_never_enabled())
>                 goto fallback;
>
> +       suggested_orders = get_suggested_order(vma->vm_mm, vma, vma->vm_flags,
> +                                              TVA_PAGEFAULT,
> +                                              BIT(PMD_ORDER) - 1);

Can we separate this case from the normal anonymous page faults below?
We’ve observed that swapping in large folios can lead to more
swap thrashing for some workloads- e.g. kernel build. Consequently,
some workloads
might prefer swapping in smaller folios than those allocated by
alloc_anon_folio().

> +       if (!suggested_orders)
> +               goto fallback;
>         entry = pte_to_swp_entry(vmf->orig_pte);
>         /*
>          * Get a list of all the (large) orders below PMD_ORDER that are enabled
>          * and suitable for swapping THP.
>          */
>         orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT,
> -                                         BIT(PMD_ORDER) - 1);
> +                                         suggested_orders);
>         orders = thp_vma_suitable_orders(vma, vmf->address, orders);
>         orders = thp_swap_suitable_orders(swp_offset(entry),
>                                           vmf->address, orders);
> @@ -5044,12 +5049,12 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>  {
>         struct vm_area_struct *vma = vmf->vma;
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +       int order, suggested_orders;
>         unsigned long orders;
>         struct folio *folio;
>         unsigned long addr;
>         pte_t *pte;
>         gfp_t gfp;
> -       int order;
>
>         /*
>          * If uffd is active for the vma we need per-page fault fidelity to
> @@ -5058,13 +5063,18 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>         if (unlikely(userfaultfd_armed(vma)))
>                 goto fallback;
>
> +       suggested_orders = get_suggested_order(vma->vm_mm, vma, vma->vm_flags,
> +                                              TVA_PAGEFAULT,
> +                                              BIT(PMD_ORDER) - 1);
> +       if (!suggested_orders)
> +               goto fallback;

Thanks
Barry


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 01/10] mm: thp: add support for BPF based THP order selection
  2025-08-29  4:56   ` Barry Song
@ 2025-08-29  5:36     ` Yafang Shao
  0 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2025-08-29  5:36 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, bpf, linux-mm, linux-doc

On Fri, Aug 29, 2025 at 12:56 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Aug 26, 2025 at 3:43 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> [...]
>
> > @@ -4510,13 +4510,18 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> >         if (!zswap_never_enabled())
> >                 goto fallback;
> >
> > +       suggested_orders = get_suggested_order(vma->vm_mm, vma, vma->vm_flags,
> > +                                              TVA_PAGEFAULT,
> > +                                              BIT(PMD_ORDER) - 1);
>
> Can we separate this case from the normal anonymous page faults below?
> We’ve observed that swapping in large folios can lead to more
> swap thrashing for some workloads- e.g. kernel build. Consequently,
> some workloads
> might prefer swapping in smaller folios than those allocated by
> alloc_anon_folio().

make sense to me.
Perhaps we can add a new tva_type such as TVA_SWAP for this case.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 01/10] mm: thp: add support for BPF based THP order selection
  2025-08-29  3:01         ` Yafang Shao
@ 2025-08-29 10:42           ` Lorenzo Stoakes
  2025-08-31  3:11             ` Yafang Shao
  0 siblings, 1 reply; 61+ messages in thread
From: Lorenzo Stoakes @ 2025-08-29 10:42 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, bpf, linux-mm,
	linux-doc

On Fri, Aug 29, 2025 at 11:01:59AM +0800, Yafang Shao wrote:
> On Thu, Aug 28, 2025 at 6:50 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Thu, Aug 28, 2025 at 01:54:39PM +0800, Yafang Shao wrote:
> > > > Also will mm ever != vma->vm_mm?
> > >
> > > No it can't. It can be guaranteed by the caller.
> >
> > In this case we don't need to pass mm separately then right?
>
> Right, we need to pass either @mm or @vma. However, there are cases
> where vma information is not available at certain call sites, such as
> in khugepaged. In those cases, we need to pass @mm instead.

Yeah... this is weird to me though, are you checking in _general_ what
khugepaged should use, or otherwise surely it's per-VMA?

Otherwise this bpf hook seems ill-suited for that, and we should have a
separate one for khugepaged surely?

I also hate that we're passing mm _just because of this one edge case_,
otherwise always passing vma->vm_mm, it's a confusing interface.

>
> >
> > >
> > > >
> > > > Are we hacking this for the sake of overloading what this does?
> > >
> > > The @vma is actually unneeded. I will remove it.
> >
> > Ah OK.
> >
> > I am still a little concerned about passing around a value reference to the VMA
> > flags though, esp as this type can + will change in future (not sure what that
> > means for BPF).
> >
> > We may go to e.g. a 128 bit bitmap there etc.
>
> As mentioned in another thread, we only need to determine whether the
> flag is VM_HUGEPAGE or VM_NOHUGEPAGE, so it can be simplified.

OK cool thanks. Maybe missed.

>
> >
> >
> > >
> > > >
> > > > Also if we're returning a bitmask of orders which you seem to be (not sure I
> > > > like that tbh - I feel like we shoudl simply provide one order but open for
> > > > disucssion) - shouldn't it return an unsigned long?
> > >
> > > We are indifferent to whether a single order or a bitmask is returned,
> > > as we only use order-0 and order-9. We have no use cases for
> > > middle-order pages, though this feature might be useful for other
> > > architectures or for some special use cases.
> >
> > Well surely we want to potentially specify a mTHP under certain circumstances
> > no?
>
> Perhaps there are use cases, but I haven’t found any use cases for
> this in our production environment. On the other hand, I can clearly
> see a risk that it could lead to more costly high-order allocations.

So why are we returning a bitmap then? Seems like we should just return a
single order in this case... I think you say below that you are open to
this?

>
> >
> > In any case I feel it's worth making any bitfield a system word size.

Also :>)

If we do move to returning a single order, should be unsigned int.

> >
> > >
> > > >
> > > > > +#else
> > > > > +static inline int
> > > > > +get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> > > > > +                 u64 vma_flags, enum tva_type tva_flags, int orders)
> > > > > +{
> > > > > +     return orders;
> > > > > +}
> > > > > +#endif
> > > > > +
> > > > >  static inline int highest_order(unsigned long orders)
> > > > >  {
> > > > >       return fls_long(orders) - 1;
> > > > > diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> > > > > index eb1946a70cff..d81c1228a21f 100644
> > > > > --- a/include/linux/khugepaged.h
> > > > > +++ b/include/linux/khugepaged.h
> > > > > @@ -4,6 +4,8 @@
> > > > >
> > > > >  #include <linux/mm.h>
> > > > >
> > > > > +#include <linux/huge_mm.h>
> > > > > +
> > > >
> > > > Hm this is iffy too, There's probably a reason we didn't include this before,
> > > > the headers can be so so fragile. Let's be cautious...
> > >
> > > I will check.
> >
> > Thanks!
> >
> > >
> > > >
> > > > >  extern unsigned int khugepaged_max_ptes_none __read_mostly;
> > > > >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > > >  extern struct attribute_group khugepaged_attr_group;
> > > > > @@ -22,7 +24,15 @@ extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> > > > >
> > > > >  static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
> > > > >  {
> > > > > -     if (mm_flags_test(MMF_VM_HUGEPAGE, oldmm))
> > > > > +     /*
> > > > > +      * THP allocation policy can be dynamically modified via BPF. Even if a
> > > > > +      * task was allowed to allocate THPs, BPF can decide whether its forked
> > > > > +      * child can allocate THPs.
> > > > > +      *
> > > > > +      * The MMF_VM_HUGEPAGE flag will be cleared by khugepaged.
> > > > > +      */
> > > > > +     if (mm_flags_test(MMF_VM_HUGEPAGE, oldmm) &&
> > > > > +             get_suggested_order(mm, NULL, 0, -1, BIT(PMD_ORDER)))
> > > >
> > > > Hmmm so there seems to be some kind of additional functionality you're providing
> > > > here kinda quietly, which is to allow the exact same interface to determine
> > > > whether we kick off khugepaged or not.
> > > >
> > > > Don't love that, I think we should be hugely specific about that.
> > > >
> > > > This bpf interface should literally be 'ok we're deciding what order we
> > > > want'. It feels like a bit of a gross overloading?
> > >
> > > This makes sense. I have no objection to reverting to returning a single order.
> >
> > OK but key point here is - we're now determining if a forked child can _not_
> > allocate THPs using this function.
> >
> > To me this should be a separate function rather than some _weird_ usage of this
> > same function.
>
> Perhaps a separate function is better.

Thanks!

>
> >
> > And generally at this point I think we should just drop this bit of code
> > honestly.
>
> MMF_VM_HUGEPAGE is set when the THP mode is "always" or "madvise". If
> it’s set, any forked child processes will inherit this flag. It is
> only cleared when the mm_struct is destroyed (please correct me if I’m
> wrong).

__mmput()
-> khugepaged_exit()
-> (if MMF_VM_HUGEPAGE set) __khugepaged_exit()
-> Clear flag once mm fully done with (afaict), dropping associated mm refcount.

^--- this does seem to be accurate indeed.

>
> However, when you switch the THP mode to "never", tasks that still
> have MMF_VM_HUGEPAGE remain on the khugepaged scan list. This isn’t an
> issue under the current global mode because khugepaged doesn’t run
> when THP is set to "never".
>
> The problem arises when we move from a global mode to a per-task mode.
> In that case, khugepaged may end up doing unnecessary work. For
> example, if the THP mode is "always", but some tasks are not allowed
> to allocate THP while still having MMF_VM_HUGEPAGE set, khugepaged
> will continue scanning them unnecessarily.

But this can change right?

I really don't like the idea _at all_ of overriding this hook to do things
other than what it says it does.

It's 'set which order to use' except when it's this case then it's 'will we
do any work'.

This should be a separate callback or we should drop this and live with the
possible additional work.

>
> To avoid this, we should prevent setting this flag for child processes
> if they are not allowed to allocate THP in the first place. This way,
> khugepaged won’t waste cycles scanning them. While an alternative
> approach would be to set the flag at fork and later clear it for
> khugepaged, it’s clearly more efficient to avoid setting it from the
> start.

We also obviously should have a comment with all this context here.


> >
> > >
> > > >
> > > > > +     if (highest_order(suggested_orders) > highest_order(orders))
> > > > > +             suggested_orders = orders;
> > > >
> > > > Hmmm so the semantics are - whichever is the highest order wins?
> > >
> > > The maximum requested order is determined by the callsite. For example:
> > > - PMD-mapped THP uses PMD_ORDER
> > > - mTHP uses (PMD_ORDER - 1)
> > >
> > > We must respect this upper bound to avoid undefined behavior. So the
> > > highest suggested order can't exceed the highest requested order.
> >
> > OK, please document this in a comment here.
>
> will doc it.

Thanks!

>
> >
> > >
> > > >
> > > > I thought the idea was we'd hand control over to bpf if provided in effect?
> > > >
> > > > Definitely worth going over these semantics in the cover letter (and do forgive
> > > > me if you have and I've missed! :)
> > >
> > > It has already in the cover letter:
> > >
> > >  * Return: Bitmask of suggested THP orders for allocation. The highest
> > >  *         suggested order will not exceed the highest requested order
> > >  *         in @orders.
> >
> > OK cool thanks, a comment here would be useful also.
>
> will add it.

Thanks!

> > > >
> > > > > +
> > > > > +static struct bpf_struct_ops bpf_bpf_thp_ops = {
> > > > > +     .verifier_ops = &thp_bpf_verifier_ops,
> > > > > +     .init = bpf_thp_init,
> > > > > +     .init_member = bpf_thp_init_member,
> > > > > +     .reg = bpf_thp_reg,
> > > > > +     .unreg = bpf_thp_unreg,
> > > > > +     .update = bpf_thp_update,
> > > > > +     .validate = bpf_thp_validate,
> > > > > +     .cfi_stubs = &__bpf_thp_ops,
> > > > > +     .owner = THIS_MODULE,
> > > > > +     .name = "bpf_thp_ops",
> > > > > +};
> > > > > +
> > > > > +static int __init bpf_thp_ops_init(void)
> > > > > +{
> > > > > +     int err = register_bpf_struct_ops(&bpf_bpf_thp_ops, bpf_thp_ops);
> > > > > +
> > > > > +     if (err)
> > > > > +             pr_err("bpf_thp: Failed to register struct_ops (%d)\n", err);
> > > > > +     return err;
> > > > > +}
> > > > > +late_initcall(bpf_thp_ops_init);
> > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > > index d89992b65acc..bd8f8f34ab3c 100644
> > > > > --- a/mm/huge_memory.c
> > > > > +++ b/mm/huge_memory.c
> > > > > @@ -1349,6 +1349,16 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> > > > >               return ret;
> > > > >       khugepaged_enter_vma(vma, vma->vm_flags);
> > > > >
> > > > > +     /*
> > > > > +      * This check must occur after khugepaged_enter_vma() because:
> > > > > +      * 1. We may permit THP allocation via khugepaged
> > > > > +      * 2. While simultaneously disallowing THP allocation
> > > > > +      *    during page fault handling
> > > > > +      */
> > > > > +     if (get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_PAGEFAULT, BIT(PMD_ORDER)) !=
> > > > > +                             BIT(PMD_ORDER))
> > > >
> > > > Hmmm so you return a bitmask of orders, but then you only allow this fault if
> > > > the only order provided is PMD order? That seems strange. Can you explain?
> > >
> > > This is in the do_huge_pmd_anonymous_page() that can only accept a PMD
> > > order, otherwise it might result in unexpected behavior.
> >
> > OK please document this in the comment.
>
> will doc it.

Thanks!

>
>
> --
> Regards
> Yafang

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 03/10] mm: thp: add a new kfunc bpf_mm_get_task()
  2025-08-29  3:15         ` Yafang Shao
@ 2025-08-29 10:42           ` Lorenzo Stoakes
  0 siblings, 0 replies; 61+ messages in thread
From: Lorenzo Stoakes @ 2025-08-29 10:42 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Andrii Nakryiko, akpm, david, ziy, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, bpf, linux-mm, linux-doc

On Fri, Aug 29, 2025 at 11:15:17AM +0800, Yafang Shao wrote:
> On Thu, Aug 28, 2025 at 6:51 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Wed, Aug 27, 2025 at 02:50:36PM -0700, Andrii Nakryiko wrote:
> > > On Wed, Aug 27, 2025 at 8:48 AM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> > > >
> > > > On Tue, Aug 26, 2025 at 03:19:41PM +0800, Yafang Shao wrote:
> > > > > We will utilize this new kfunc bpf_mm_get_task() to retrieve the
> > > > > associated task_struct from the given @mm. The obtained task_struct must
> > > > > be released by calling bpf_task_release() as a paired operation.
> > > >
> > > > You're basically describing the patch you're not saying why - yeah you're
> > > > getting a task struct from an mm (only if CONFIG_MEMCG which you don't
> > > > mention here), but not for what purpose you intend to use this?
> > > >
> > > > >
> > > > > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > > > > ---
> > > > >  mm/bpf_thp.c | 34 ++++++++++++++++++++++++++++++++++
> > > > >  1 file changed, 34 insertions(+)
> > > > >
> > > > > diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
> > > > > index b757e8f425fd..46b3bc96359e 100644
> > > > > --- a/mm/bpf_thp.c
> > > > > +++ b/mm/bpf_thp.c
> > > > > @@ -205,11 +205,45 @@ __bpf_kfunc void bpf_put_mem_cgroup(struct mem_cgroup *memcg)
> > > > >  #endif
> > > > >  }
> > > > >
> > > > > +/**
> > > > > + * bpf_mm_get_task - Get the task struct associated with a mm_struct.
> > > > > + * @mm: The mm_struct to query
> > > > > + *
> > > > > + * The obtained task_struct must be released by calling bpf_task_release().
> > > >
> > > > Hmmm so now bpf programs can cause kernel bugs by keeping a reference around?
> > >
> > > BPF verifier will reject any program that cannot guarantee that
> > > bpf_task_release() will always be called. So there shouldn't be any
> > > problem here.
> >
> > Ah that's nice!
> >
> > What specifically here is enforcing that? Apologies again - BPF is new to me.
>
> The KF_ACQUIRE and KF_RELEASE flags enforce resource management. If a
> BPF helper function (e.g., bpf_mm_get_task()) is marked with
> KF_ACQUIRE, the pointer it returns must be released by a corresponding
> helper marked with KF_RELEASE (e.g., bpf_task_release()). The BPF
> verifier will reject any program that fails to pair these calls
> correctly.

OK that's really really nice actually! :)

Thanks!

>
> --
> Regards
> Yafang

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 03/10] mm: thp: add a new kfunc bpf_mm_get_task()
  2025-08-28  6:47     ` Yafang Shao
@ 2025-08-29 10:43       ` Lorenzo Stoakes
  0 siblings, 0 replies; 61+ messages in thread
From: Lorenzo Stoakes @ 2025-08-29 10:43 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, bpf, linux-mm,
	linux-doc

On Thu, Aug 28, 2025 at 02:47:34PM +0800, Yafang Shao wrote:
> On Wed, Aug 27, 2025 at 11:42 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Tue, Aug 26, 2025 at 03:19:41PM +0800, Yafang Shao wrote:
> > > We will utilize this new kfunc bpf_mm_get_task() to retrieve the
> > > associated task_struct from the given @mm. The obtained task_struct must
> > > be released by calling bpf_task_release() as a paired operation.
> >
> > You're basically describing the patch you're not saying why - yeah you're
> > getting a task struct from an mm (only if CONFIG_MEMCG which you don't
> > mention here), but not for what purpose you intend to use this?
>
> For example, we could retrieve task->comm or other attributes and make
> decisions based on that information. I’ll provide a clearer
> description in the next revision.

Thanks!

>
> >
> > >
> > > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > > ---
> > >  mm/bpf_thp.c | 34 ++++++++++++++++++++++++++++++++++
> > >  1 file changed, 34 insertions(+)
> > >
> > > diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
> > > index b757e8f425fd..46b3bc96359e 100644
> > > --- a/mm/bpf_thp.c
> > > +++ b/mm/bpf_thp.c
> > > @@ -205,11 +205,45 @@ __bpf_kfunc void bpf_put_mem_cgroup(struct mem_cgroup *memcg)
> > >  #endif
> > >  }
> > >
> > > +/**
> > > + * bpf_mm_get_task - Get the task struct associated with a mm_struct.
> > > + * @mm: The mm_struct to query
> > > + *
> > > + * The obtained task_struct must be released by calling bpf_task_release().
> >
> > Hmmm so now bpf programs can cause kernel bugs by keeping a reference around?
> >
> > This feels extremely dodgy, I don't like this at all.
> >
> > I thought the whole point of BPF was that this kind of thing couldn't possibly
> > happen?
> >
> > Or would this be a kernel bug?
> >
> > If a bpf program can lead to a refcount not being put, this is not
> > upstreamable surely?
>
> As explained by Andrii, the BPF verifier can protect it.

Yeah that's nice!

>
> >
> > > + *
> > > + * Return: The associated task_struct on success, or NULL on failure. Note that
> > > + * this function depends on CONFIG_MEMCG being enabled - it will always return
> > > + * NULL if CONFIG_MEMCG is not configured.
> > > + */
> > > +__bpf_kfunc struct task_struct *bpf_mm_get_task(struct mm_struct *mm)
> > > +{
> > > +#ifdef CONFIG_MEMCG
> > > +     struct task_struct *task;
> > > +
> > > +     if (!mm)
> > > +             return NULL;
> > > +     rcu_read_lock();
> > > +     task = rcu_dereference(mm->owner);
> >
> > > +     if (!task)
> > > +             goto out;
> > > +     if (!refcount_inc_not_zero(&task->rcu_users))
> > > +             goto out;
> > > +
> > > +     rcu_read_unlock();
> > > +     return task;
> > > +
> > > +out:
> > > +     rcu_read_unlock();
> > > +#endif
> >
> > This #ifdeffery is horrid, can we please just have separate functions instead of
> > inside the one? Thanks.
> >
> > > +     return NULL;
> >
> > So we can't tell the difference between this failling due to CONFIG_MEMCG
> > not being set (in which case it will _always_ fail) or we couldn't get a
> > task or we couldn't get a refcount on the task.
> >
> > Maybe this doesn't matter since perhaps we are only using this if
> > CONFIG_MEMCG but in that case why even expose this if !CONFIG_MEMCG?
> >
>
> As suggested by Andrii, I will remove this kfunc and mark mm->owner as
> BTF_TYPE_SAFE_TRUSTED_OR_NULL.

OK thanks!

>
> Thanks for your comments.

You're welcome :)

>
> --
> Regards
> Yafang

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 02/10] mm: thp: add a new kfunc bpf_mm_get_mem_cgroup()
  2025-08-28 16:00         ` Shakeel Butt
@ 2025-08-29 10:45           ` Lorenzo Stoakes
  0 siblings, 0 replies; 61+ messages in thread
From: Lorenzo Stoakes @ 2025-08-29 10:45 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Yafang Shao, akpm, david, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
	willy, ast, daniel, andrii, ameryhung, rientjes, corbet, bpf,
	linux-mm, linux-doc, Michal Hocko, Roman Gushchin

On Thu, Aug 28, 2025 at 09:00:15AM -0700, Shakeel Butt wrote:
> On Thu, Aug 28, 2025 at 11:40:16AM +0100, Lorenzo Stoakes wrote:
> > On Wed, Aug 27, 2025 at 01:50:18PM -0700, Shakeel Butt wrote:
> > > On Wed, Aug 27, 2025 at 04:34:48PM +0100, Lorenzo Stoakes wrote:
> > > > > +__bpf_kfunc_start_defs();
> > > > > +
> > > > > +/**
> > > > > + * bpf_mm_get_mem_cgroup - Get the memory cgroup associated with a mm_struct.
> > > > > + * @mm: The mm_struct to query
> > > > > + *
> > > > > + * The obtained mem_cgroup must be released by calling bpf_put_mem_cgroup().
> > > > > + *
> > > > > + * Return: The associated mem_cgroup on success, or NULL on failure. Note that
> > > > > + * this function depends on CONFIG_MEMCG being enabled - it will always return
> > > > > + * NULL if CONFIG_MEMCG is not configured.
> > > >
> > > > What kind of locking is assumed here?
> > > >
> > > > Are we protected against mmdrop() clearing out the mm?
> > >
> > > No locking is needed. Just the valid mm object or NULL. Usually the
> > > underlying function (get_mem_cgroup_from_mm) is called in page fault
> > > context where the current is holding mm. Here the only requirement is
> > > that mm is valid either through explicit reference or the context.
> >
> > I mean this may be down to me being not so familiar with BPF, but my concern is
> > that we're handing _any_ mm here.
>
> It's not really any mm but rather the mm whose validity is ensured by
> the caller. I don't know the BPF internals but if I understand Andrii's
> response on other email, the BPF verifier will make sure the BPF program
> is holding a valid mm on which it is calling this function. In non-BPF
> world, get_mem_cgroup_from_mm() assumes the caller is providing a valid
> mm.

OK cool. The verifier aspect of this is really nice... :)

>
> >
> > So presumably this could also be a remote mm?
>
> Which is fine as we already do this today i.e. page fault on accessing
> memory of a remote process.

OK.

>
> >
> > If not then why are we accepting an mm parameter at all, when we could just grab
> > current->mm?
>
> Because current->mm might not be equal to the faulting mm as in the case
> of remote page fault.

Ack yeah as per above.

>
> >
> > If it's a remote mm, then we need to be absolutely sure that we won't UAF.
> >
> > I also feel we should talk about this in the kdoc, unless BPF always somehow
> > asserts these things to be the case + verifies them smoehow.
> >
>
> Yeah some text on how BPF verifier is making sure that the BPF program
> is handling a valid mm.

This would be nice indeed!


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 04/10] bpf: mark vma->vm_mm as trusted
  2025-08-29  3:05         ` Yafang Shao
@ 2025-08-29 10:49           ` Lorenzo Stoakes
  2025-08-31  3:16             ` Yafang Shao
  0 siblings, 1 reply; 61+ messages in thread
From: Lorenzo Stoakes @ 2025-08-29 10:49 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, bpf, linux-mm,
	linux-doc

On Fri, Aug 29, 2025 at 11:05:01AM +0800, Yafang Shao wrote:
> On Thu, Aug 28, 2025 at 7:11 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Thu, Aug 28, 2025 at 02:12:12PM +0800, Yafang Shao wrote:
> > > On Wed, Aug 27, 2025 at 11:46 PM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> > > >
> > > > On Tue, Aug 26, 2025 at 03:19:42PM +0800, Yafang Shao wrote:
> > > > > Every VMA must have an associated mm_struct, and it is safe to access
> > > >
> > > > Err this isn't true? Pretty sure special VMAs don't have that set.
> > >
> > > I’m not aware of any VMA that doesn’t belong to an mm_struct. If there
> > > is such a case, it would be helpful if you could point it out. In any
> > > case, I’ll remove the VMA-related code in the next version since it’s
> > > unnecessary.
> >
> > If you lok at get_vma_name() in fs/proc/task_mmu.c you'll see:
> >
> >         if (!vma->vm_mm) {
> >                 *name = "[vdso]";
> >                 return;
> >         }
> >
> > So a VDSO will have this condition.
> >
> > I did a quick drgn()/printk() test and didn't see any, but maybe my system - but
> > in any case this appears to be a valid situation that can arise, presumably
> > because it's a VMA somehow shared with multiple mm's or something truly god
> > awful like that :)
>
> Thanks for clarifying that.

No problem! These weird edge cases are... weird and hugely confusing. I should
document some of this somewhere, as it's at the moment more 'oh yeah I
remember...' then having to dig through to figure it out.

The "/dev/zero file-backed but actually anon if MAP_PRIVATE'd" is another fun
unique case.

>
> --
> Regards
> Yafang


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 01/10] mm: thp: add support for BPF based THP order selection
  2025-08-29 10:42           ` Lorenzo Stoakes
@ 2025-08-31  3:11             ` Yafang Shao
  2025-09-01 11:39               ` Lorenzo Stoakes
  0 siblings, 1 reply; 61+ messages in thread
From: Yafang Shao @ 2025-08-31  3:11 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, bpf, linux-mm,
	linux-doc

On Fri, Aug 29, 2025 at 6:42 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Fri, Aug 29, 2025 at 11:01:59AM +0800, Yafang Shao wrote:
> > On Thu, Aug 28, 2025 at 6:50 PM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > On Thu, Aug 28, 2025 at 01:54:39PM +0800, Yafang Shao wrote:
> > > > > Also will mm ever != vma->vm_mm?
> > > >
> > > > No it can't. It can be guaranteed by the caller.
> > >
> > > In this case we don't need to pass mm separately then right?
> >
> > Right, we need to pass either @mm or @vma. However, there are cases
> > where vma information is not available at certain call sites, such as
> > in khugepaged. In those cases, we need to pass @mm instead.
>
> Yeah... this is weird to me though, are you checking in _general_ what
> khugepaged should use, or otherwise surely it's per-VMA?
>
> Otherwise this bpf hook seems ill-suited for that, and we should have a
> separate one for khugepaged surely?
>
> I also hate that we're passing mm _just because of this one edge case_,
> otherwise always passing vma->vm_mm, it's a confusing interface.

make sense.
I'll give some thought to how we can better handle this edge case.

>
> >
> > >
> > > >
> > > > >
> > > > > Are we hacking this for the sake of overloading what this does?
> > > >
> > > > The @vma is actually unneeded. I will remove it.
> > >
> > > Ah OK.
> > >
> > > I am still a little concerned about passing around a value reference to the VMA
> > > flags though, esp as this type can + will change in future (not sure what that
> > > means for BPF).
> > >
> > > We may go to e.g. a 128 bit bitmap there etc.
> >
> > As mentioned in another thread, we only need to determine whether the
> > flag is VM_HUGEPAGE or VM_NOHUGEPAGE, so it can be simplified.
>
> OK cool thanks. Maybe missed.
>
> >
> > >
> > >
> > > >
> > > > >
> > > > > Also if we're returning a bitmask of orders which you seem to be (not sure I
> > > > > like that tbh - I feel like we shoudl simply provide one order but open for
> > > > > disucssion) - shouldn't it return an unsigned long?
> > > >
> > > > We are indifferent to whether a single order or a bitmask is returned,
> > > > as we only use order-0 and order-9. We have no use cases for
> > > > middle-order pages, though this feature might be useful for other
> > > > architectures or for some special use cases.
> > >
> > > Well surely we want to potentially specify a mTHP under certain circumstances
> > > no?
> >
> > Perhaps there are use cases, but I haven’t found any use cases for
> > this in our production environment. On the other hand, I can clearly
> > see a risk that it could lead to more costly high-order allocations.
>
> So why are we returning a bitmap then? Seems like we should just return a
> single order in this case... I think you say below that you are open to
> this?

will return a single order in the next version.

>
> >
> > >
> > > In any case I feel it's worth making any bitfield a system word size.
>
> Also :>)
>
> If we do move to returning a single order, should be unsigned int.

sure

>
> > >
> > > >
> > > > >
> > > > > > +#else
> > > > > > +static inline int
> > > > > > +get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> > > > > > +                 u64 vma_flags, enum tva_type tva_flags, int orders)
> > > > > > +{
> > > > > > +     return orders;
> > > > > > +}
> > > > > > +#endif
> > > > > > +
> > > > > >  static inline int highest_order(unsigned long orders)
> > > > > >  {
> > > > > >       return fls_long(orders) - 1;
> > > > > > diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> > > > > > index eb1946a70cff..d81c1228a21f 100644
> > > > > > --- a/include/linux/khugepaged.h
> > > > > > +++ b/include/linux/khugepaged.h
> > > > > > @@ -4,6 +4,8 @@
> > > > > >
> > > > > >  #include <linux/mm.h>
> > > > > >
> > > > > > +#include <linux/huge_mm.h>
> > > > > > +
> > > > >
> > > > > Hm this is iffy too, There's probably a reason we didn't include this before,
> > > > > the headers can be so so fragile. Let's be cautious...
> > > >
> > > > I will check.
> > >
> > > Thanks!
> > >
> > > >
> > > > >
> > > > > >  extern unsigned int khugepaged_max_ptes_none __read_mostly;
> > > > > >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > > > >  extern struct attribute_group khugepaged_attr_group;
> > > > > > @@ -22,7 +24,15 @@ extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> > > > > >
> > > > > >  static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
> > > > > >  {
> > > > > > -     if (mm_flags_test(MMF_VM_HUGEPAGE, oldmm))
> > > > > > +     /*
> > > > > > +      * THP allocation policy can be dynamically modified via BPF. Even if a
> > > > > > +      * task was allowed to allocate THPs, BPF can decide whether its forked
> > > > > > +      * child can allocate THPs.
> > > > > > +      *
> > > > > > +      * The MMF_VM_HUGEPAGE flag will be cleared by khugepaged.
> > > > > > +      */
> > > > > > +     if (mm_flags_test(MMF_VM_HUGEPAGE, oldmm) &&
> > > > > > +             get_suggested_order(mm, NULL, 0, -1, BIT(PMD_ORDER)))
> > > > >
> > > > > Hmmm so there seems to be some kind of additional functionality you're providing
> > > > > here kinda quietly, which is to allow the exact same interface to determine
> > > > > whether we kick off khugepaged or not.
> > > > >
> > > > > Don't love that, I think we should be hugely specific about that.
> > > > >
> > > > > This bpf interface should literally be 'ok we're deciding what order we
> > > > > want'. It feels like a bit of a gross overloading?
> > > >
> > > > This makes sense. I have no objection to reverting to returning a single order.
> > >
> > > OK but key point here is - we're now determining if a forked child can _not_
> > > allocate THPs using this function.
> > >
> > > To me this should be a separate function rather than some _weird_ usage of this
> > > same function.
> >
> > Perhaps a separate function is better.
>
> Thanks!
>
> >
> > >
> > > And generally at this point I think we should just drop this bit of code
> > > honestly.
> >
> > MMF_VM_HUGEPAGE is set when the THP mode is "always" or "madvise". If
> > it’s set, any forked child processes will inherit this flag. It is
> > only cleared when the mm_struct is destroyed (please correct me if I’m
> > wrong).
>
> __mmput()
> -> khugepaged_exit()
> -> (if MMF_VM_HUGEPAGE set) __khugepaged_exit()
> -> Clear flag once mm fully done with (afaict), dropping associated mm refcount.
>
> ^--- this does seem to be accurate indeed.

Thanks for the explanation.

>
> >
> > However, when you switch the THP mode to "never", tasks that still
> > have MMF_VM_HUGEPAGE remain on the khugepaged scan list. This isn’t an
> > issue under the current global mode because khugepaged doesn’t run
> > when THP is set to "never".
> >
> > The problem arises when we move from a global mode to a per-task mode.
> > In that case, khugepaged may end up doing unnecessary work. For
> > example, if the THP mode is "always", but some tasks are not allowed
> > to allocate THP while still having MMF_VM_HUGEPAGE set, khugepaged
> > will continue scanning them unnecessarily.
>
> But this can change right?
>
> I really don't like the idea _at all_ of overriding this hook to do things
> other than what it says it does.
>
> It's 'set which order to use' except when it's this case then it's 'will we
> do any work'.
>
> This should be a separate callback or we should drop this and live with the
> possible additional work.

Perhaps we could reuse the MMF_DISABLE_THP flag by introducing a new
BPF helper to set it when we want to disable THP for a specific task.

Separately from this patchset, I realized we can optimize khugepaged
handling for the MMF_DISABLE_THP case with the following changes:

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 15203ea7d007..e9964edcee29 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -402,6 +402,11 @@ void __init khugepaged_destroy(void)
        kmem_cache_destroy(mm_slot_cache);
 }

+static inline int hpage_collapse_test_disable(struct mm_struct *mm)
+{
+       return test_bit(MMF_DISABLE_THP, &mm->flags);
+}
+
 static inline int hpage_collapse_test_exit(struct mm_struct *mm)
 {
        return atomic_read(&mm->mm_users) == 0;
@@ -1448,6 +1453,11 @@ static void collect_mm_slot(struct
khugepaged_mm_slot *mm_slot)
                /* khugepaged_mm_lock actually not necessary for the below */
                mm_slot_free(mm_slot_cache, mm_slot);
                mmdrop(mm);
+       } else if (hpage_collapse_test_disable(mm)) {
+               hash_del(&slot->hash);
+               list_del(&slot->mm_node);
+               mm_flags_clear(MMF_VM_HUGEPAGE, mm);
+               mm_slot_free(mm_slot_cache, mm_slot);
        }
 }

Specifically, if MMF_DISABLE_THP is set, we should remove it from
mm_slot to prevent unnecessary khugepaged processing.

>
> >
> > To avoid this, we should prevent setting this flag for child processes
> > if they are not allowed to allocate THP in the first place. This way,
> > khugepaged won’t waste cycles scanning them. While an alternative
> > approach would be to set the flag at fork and later clear it for
> > khugepaged, it’s clearly more efficient to avoid setting it from the
> > start.
>
> We also obviously should have a comment with all this context here.

Understood. I'll give some thought to a better way of handling this.

-- 
Regards
Yafang


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 04/10] bpf: mark vma->vm_mm as trusted
  2025-08-29 10:49           ` Lorenzo Stoakes
@ 2025-08-31  3:16             ` Yafang Shao
  2025-09-01 10:36               ` Lorenzo Stoakes
  0 siblings, 1 reply; 61+ messages in thread
From: Yafang Shao @ 2025-08-31  3:16 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, bpf, linux-mm,
	linux-doc

On Fri, Aug 29, 2025 at 6:49 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Fri, Aug 29, 2025 at 11:05:01AM +0800, Yafang Shao wrote:
> > On Thu, Aug 28, 2025 at 7:11 PM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > On Thu, Aug 28, 2025 at 02:12:12PM +0800, Yafang Shao wrote:
> > > > On Wed, Aug 27, 2025 at 11:46 PM Lorenzo Stoakes
> > > > <lorenzo.stoakes@oracle.com> wrote:
> > > > >
> > > > > On Tue, Aug 26, 2025 at 03:19:42PM +0800, Yafang Shao wrote:
> > > > > > Every VMA must have an associated mm_struct, and it is safe to access
> > > > >
> > > > > Err this isn't true? Pretty sure special VMAs don't have that set.
> > > >
> > > > I’m not aware of any VMA that doesn’t belong to an mm_struct. If there
> > > > is such a case, it would be helpful if you could point it out. In any
> > > > case, I’ll remove the VMA-related code in the next version since it’s
> > > > unnecessary.
> > >
> > > If you lok at get_vma_name() in fs/proc/task_mmu.c you'll see:
> > >
> > >         if (!vma->vm_mm) {
> > >                 *name = "[vdso]";
> > >                 return;
> > >         }
> > >
> > > So a VDSO will have this condition.
> > >
> > > I did a quick drgn()/printk() test and didn't see any, but maybe my system - but
> > > in any case this appears to be a valid situation that can arise, presumably
> > > because it's a VMA somehow shared with multiple mm's or something truly god
> > > awful like that :)
> >
> > Thanks for clarifying that.
>
> No problem! These weird edge cases are... weird and hugely confusing. I should
> document some of this somewhere, as it's at the moment more 'oh yeah I
> remember...' then having to dig through to figure it out.
>
> The "/dev/zero file-backed but actually anon if MAP_PRIVATE'd" is another fun
> unique case.

It would be immensely helpful if you could document these cases. We
truly appreciate your contribution and the time you've invested in
this.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 04/10] bpf: mark vma->vm_mm as trusted
  2025-08-31  3:16             ` Yafang Shao
@ 2025-09-01 10:36               ` Lorenzo Stoakes
  0 siblings, 0 replies; 61+ messages in thread
From: Lorenzo Stoakes @ 2025-09-01 10:36 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, bpf, linux-mm,
	linux-doc

On Sun, Aug 31, 2025 at 11:16:27AM +0800, Yafang Shao wrote:
> On Fri, Aug 29, 2025 at 6:49 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Fri, Aug 29, 2025 at 11:05:01AM +0800, Yafang Shao wrote:
> > > On Thu, Aug 28, 2025 at 7:11 PM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> > > >
> > > > On Thu, Aug 28, 2025 at 02:12:12PM +0800, Yafang Shao wrote:
> > > > > On Wed, Aug 27, 2025 at 11:46 PM Lorenzo Stoakes
> > > > > <lorenzo.stoakes@oracle.com> wrote:
> > > > > >
> > > > > > On Tue, Aug 26, 2025 at 03:19:42PM +0800, Yafang Shao wrote:
> > > > > > > Every VMA must have an associated mm_struct, and it is safe to access
> > > > > >
> > > > > > Err this isn't true? Pretty sure special VMAs don't have that set.
> > > > >
> > > > > I’m not aware of any VMA that doesn’t belong to an mm_struct. If there
> > > > > is such a case, it would be helpful if you could point it out. In any
> > > > > case, I’ll remove the VMA-related code in the next version since it’s
> > > > > unnecessary.
> > > >
> > > > If you lok at get_vma_name() in fs/proc/task_mmu.c you'll see:
> > > >
> > > >         if (!vma->vm_mm) {
> > > >                 *name = "[vdso]";
> > > >                 return;
> > > >         }
> > > >
> > > > So a VDSO will have this condition.
> > > >
> > > > I did a quick drgn()/printk() test and didn't see any, but maybe my system - but
> > > > in any case this appears to be a valid situation that can arise, presumably
> > > > because it's a VMA somehow shared with multiple mm's or something truly god
> > > > awful like that :)
> > >
> > > Thanks for clarifying that.
> >
> > No problem! These weird edge cases are... weird and hugely confusing. I should
> > document some of this somewhere, as it's at the moment more 'oh yeah I
> > remember...' then having to dig through to figure it out.
> >
> > The "/dev/zero file-backed but actually anon if MAP_PRIVATE'd" is another fun
> > unique case.
>
> It would be immensely helpful if you could document these cases. We
> truly appreciate your contribution and the time you've invested in
> this.

Sure I will add to my TODOs :) I agree it's not great that we have these odd
edge cases and do not document them clearly enough.

Will get to it :)

>
> --
> Regards
> Yafang

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 01/10] mm: thp: add support for BPF based THP order selection
  2025-08-31  3:11             ` Yafang Shao
@ 2025-09-01 11:39               ` Lorenzo Stoakes
  2025-09-02  2:48                 ` Yafang Shao
  0 siblings, 1 reply; 61+ messages in thread
From: Lorenzo Stoakes @ 2025-09-01 11:39 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, bpf, linux-mm,
	linux-doc

On Sun, Aug 31, 2025 at 11:11:34AM +0800, Yafang Shao wrote:
> On Fri, Aug 29, 2025 at 6:42 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Fri, Aug 29, 2025 at 11:01:59AM +0800, Yafang Shao wrote:
> > > On Thu, Aug 28, 2025 at 6:50 PM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> > > >
> > > > On Thu, Aug 28, 2025 at 01:54:39PM +0800, Yafang Shao wrote:
> > > > > > Also will mm ever != vma->vm_mm?
> > > > >
> > > > > No it can't. It can be guaranteed by the caller.
> > > >
> > > > In this case we don't need to pass mm separately then right?
> > >
> > > Right, we need to pass either @mm or @vma. However, there are cases
> > > where vma information is not available at certain call sites, such as
> > > in khugepaged. In those cases, we need to pass @mm instead.
> >
> > Yeah... this is weird to me though, are you checking in _general_ what
> > khugepaged should use, or otherwise surely it's per-VMA?
> >
> > Otherwise this bpf hook seems ill-suited for that, and we should have a
> > separate one for khugepaged surely?
> >
> > I also hate that we're passing mm _just because of this one edge case_,
> > otherwise always passing vma->vm_mm, it's a confusing interface.
>
> make sense.
> I'll give some thought to how we can better handle this edge case.

Thanks!

> > >
> > > >
> > > >
> > > > >
> > > > > >
> > > > > > Also if we're returning a bitmask of orders which you seem to be (not sure I
> > > > > > like that tbh - I feel like we shoudl simply provide one order but open for
> > > > > > disucssion) - shouldn't it return an unsigned long?
> > > > >
> > > > > We are indifferent to whether a single order or a bitmask is returned,
> > > > > as we only use order-0 and order-9. We have no use cases for
> > > > > middle-order pages, though this feature might be useful for other
> > > > > architectures or for some special use cases.
> > > >
> > > > Well surely we want to potentially specify a mTHP under certain circumstances
> > > > no?
> > >
> > > Perhaps there are use cases, but I haven’t found any use cases for
> > > this in our production environment. On the other hand, I can clearly
> > > see a risk that it could lead to more costly high-order allocations.
> >
> > So why are we returning a bitmap then? Seems like we should just return a
> > single order in this case... I think you say below that you are open to
> > this?
>
> will return a single order in the next version.

Thanks

>
> >
> > >
> > > >
> > > > In any case I feel it's worth making any bitfield a system word size.
> >
> > Also :>)
> >
> > If we do move to returning a single order, should be unsigned int.
>
> sure

Thanks!

> > >
> > > >
> > > > And generally at this point I think we should just drop this bit of code
> > > > honestly.
> > >
> > > MMF_VM_HUGEPAGE is set when the THP mode is "always" or "madvise". If
> > > it’s set, any forked child processes will inherit this flag. It is
> > > only cleared when the mm_struct is destroyed (please correct me if I’m
> > > wrong).
> >
> > __mmput()
> > -> khugepaged_exit()
> > -> (if MMF_VM_HUGEPAGE set) __khugepaged_exit()
> > -> Clear flag once mm fully done with (afaict), dropping associated mm refcount.
> >
> > ^--- this does seem to be accurate indeed.
>
> Thanks for the explanation.

No problem, this was more 'Lorenzo's thought process' :P

>
> >
> > >
> > > However, when you switch the THP mode to "never", tasks that still
> > > have MMF_VM_HUGEPAGE remain on the khugepaged scan list. This isn’t an
> > > issue under the current global mode because khugepaged doesn’t run
> > > when THP is set to "never".
> > >
> > > The problem arises when we move from a global mode to a per-task mode.
> > > In that case, khugepaged may end up doing unnecessary work. For
> > > example, if the THP mode is "always", but some tasks are not allowed
> > > to allocate THP while still having MMF_VM_HUGEPAGE set, khugepaged
> > > will continue scanning them unnecessarily.
> >
> > But this can change right?
> >
> > I really don't like the idea _at all_ of overriding this hook to do things
> > other than what it says it does.
> >
> > It's 'set which order to use' except when it's this case then it's 'will we
> > do any work'.
> >
> > This should be a separate callback or we should drop this and live with the
> > possible additional work.
>
> Perhaps we could reuse the MMF_DISABLE_THP flag by introducing a new
> BPF helper to set it when we want to disable THP for a specific task.

Interesting, yeah perhaps that could work, as long as we're in a sensible
context to be able to toggle this bit.

>
> Separately from this patchset, I realized we can optimize khugepaged
> handling for the MMF_DISABLE_THP case with the following changes:
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 15203ea7d007..e9964edcee29 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -402,6 +402,11 @@ void __init khugepaged_destroy(void)
>         kmem_cache_destroy(mm_slot_cache);
>  }
>
> +static inline int hpage_collapse_test_disable(struct mm_struct *mm)
> +{
> +       return test_bit(MMF_DISABLE_THP, &mm->flags);
> +}
> +
>  static inline int hpage_collapse_test_exit(struct mm_struct *mm)
>  {
>         return atomic_read(&mm->mm_users) == 0;
> @@ -1448,6 +1453,11 @@ static void collect_mm_slot(struct
> khugepaged_mm_slot *mm_slot)
>                 /* khugepaged_mm_lock actually not necessary for the below */
>                 mm_slot_free(mm_slot_cache, mm_slot);
>                 mmdrop(mm);
> +       } else if (hpage_collapse_test_disable(mm)) {
> +               hash_del(&slot->hash);
> +               list_del(&slot->mm_node);
> +               mm_flags_clear(MMF_VM_HUGEPAGE, mm);
> +               mm_slot_free(mm_slot_cache, mm_slot);
>         }
>  }
>
> Specifically, if MMF_DISABLE_THP is set, we should remove it from
> mm_slot to prevent unnecessary khugepaged processing.

Ohhh interesting, perhaps send as separate patch?

>
> >
> > >
> > > To avoid this, we should prevent setting this flag for child processes
> > > if they are not allowed to allocate THP in the first place. This way,
> > > khugepaged won’t waste cycles scanning them. While an alternative
> > > approach would be to set the flag at fork and later clear it for
> > > khugepaged, it’s clearly more efficient to avoid setting it from the
> > > start.
> >
> > We also obviously should have a comment with all this context here.
>
> Understood. I'll give some thought to a better way of handling this.

Thanks!

>
> --
> Regards
> Yafang

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 01/10] mm: thp: add support for BPF based THP order selection
  2025-09-01 11:39               ` Lorenzo Stoakes
@ 2025-09-02  2:48                 ` Yafang Shao
  2025-09-02  7:50                   ` Lorenzo Stoakes
  0 siblings, 1 reply; 61+ messages in thread
From: Yafang Shao @ 2025-09-02  2:48 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, bpf, linux-mm,
	linux-doc

On Mon, Sep 1, 2025 at 7:39 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Sun, Aug 31, 2025 at 11:11:34AM +0800, Yafang Shao wrote:
> > On Fri, Aug 29, 2025 at 6:42 PM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > On Fri, Aug 29, 2025 at 11:01:59AM +0800, Yafang Shao wrote:
> > > > On Thu, Aug 28, 2025 at 6:50 PM Lorenzo Stoakes
> > > > <lorenzo.stoakes@oracle.com> wrote:
> > > > >
> > > > > On Thu, Aug 28, 2025 at 01:54:39PM +0800, Yafang Shao wrote:
> > > > > > > Also will mm ever != vma->vm_mm?
> > > > > >
> > > > > > No it can't. It can be guaranteed by the caller.
> > > > >
> > > > > In this case we don't need to pass mm separately then right?
> > > >
> > > > Right, we need to pass either @mm or @vma. However, there are cases
> > > > where vma information is not available at certain call sites, such as
> > > > in khugepaged. In those cases, we need to pass @mm instead.
> > >
> > > Yeah... this is weird to me though, are you checking in _general_ what
> > > khugepaged should use, or otherwise surely it's per-VMA?
> > >
> > > Otherwise this bpf hook seems ill-suited for that, and we should have a
> > > separate one for khugepaged surely?
> > >
> > > I also hate that we're passing mm _just because of this one edge case_,
> > > otherwise always passing vma->vm_mm, it's a confusing interface.
> >
> > make sense.
> > I'll give some thought to how we can better handle this edge case.
>
> Thanks!
>
> > > >
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > Also if we're returning a bitmask of orders which you seem to be (not sure I
> > > > > > > like that tbh - I feel like we shoudl simply provide one order but open for
> > > > > > > disucssion) - shouldn't it return an unsigned long?
> > > > > >
> > > > > > We are indifferent to whether a single order or a bitmask is returned,
> > > > > > as we only use order-0 and order-9. We have no use cases for
> > > > > > middle-order pages, though this feature might be useful for other
> > > > > > architectures or for some special use cases.
> > > > >
> > > > > Well surely we want to potentially specify a mTHP under certain circumstances
> > > > > no?
> > > >
> > > > Perhaps there are use cases, but I haven’t found any use cases for
> > > > this in our production environment. On the other hand, I can clearly
> > > > see a risk that it could lead to more costly high-order allocations.
> > >
> > > So why are we returning a bitmap then? Seems like we should just return a
> > > single order in this case... I think you say below that you are open to
> > > this?
> >
> > will return a single order in the next version.
>
> Thanks
>
> >
> > >
> > > >
> > > > >
> > > > > In any case I feel it's worth making any bitfield a system word size.
> > >
> > > Also :>)
> > >
> > > If we do move to returning a single order, should be unsigned int.
> >
> > sure
>
> Thanks!
>
> > > >
> > > > >
> > > > > And generally at this point I think we should just drop this bit of code
> > > > > honestly.
> > > >
> > > > MMF_VM_HUGEPAGE is set when the THP mode is "always" or "madvise". If
> > > > it’s set, any forked child processes will inherit this flag. It is
> > > > only cleared when the mm_struct is destroyed (please correct me if I’m
> > > > wrong).
> > >
> > > __mmput()
> > > -> khugepaged_exit()
> > > -> (if MMF_VM_HUGEPAGE set) __khugepaged_exit()
> > > -> Clear flag once mm fully done with (afaict), dropping associated mm refcount.
> > >
> > > ^--- this does seem to be accurate indeed.
> >
> > Thanks for the explanation.
>
> No problem, this was more 'Lorenzo's thought process' :P
>
> >
> > >
> > > >
> > > > However, when you switch the THP mode to "never", tasks that still
> > > > have MMF_VM_HUGEPAGE remain on the khugepaged scan list. This isn’t an
> > > > issue under the current global mode because khugepaged doesn’t run
> > > > when THP is set to "never".
> > > >
> > > > The problem arises when we move from a global mode to a per-task mode.
> > > > In that case, khugepaged may end up doing unnecessary work. For
> > > > example, if the THP mode is "always", but some tasks are not allowed
> > > > to allocate THP while still having MMF_VM_HUGEPAGE set, khugepaged
> > > > will continue scanning them unnecessarily.
> > >
> > > But this can change right?
> > >
> > > I really don't like the idea _at all_ of overriding this hook to do things
> > > other than what it says it does.
> > >
> > > It's 'set which order to use' except when it's this case then it's 'will we
> > > do any work'.
> > >
> > > This should be a separate callback or we should drop this and live with the
> > > possible additional work.
> >
> > Perhaps we could reuse the MMF_DISABLE_THP flag by introducing a new
> > BPF helper to set it when we want to disable THP for a specific task.
>
> Interesting, yeah perhaps that could work, as long as we're in a sensible
> context to be able to toggle this bit.

Right, we can't set the mm->flags arbitrarily.
Perhaps we should add a generic BPF hook in dup_mmap().

diff --git a/mm/mmap.c b/mm/mmap.c
index 7a057e0e8da9..1b60bdb08de1 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1843,6 +1843,8 @@ __latent_entropy int dup_mmap(struct mm_struct
*mm, struct mm_struct *oldmm)
 loop_out:
        vma_iter_free(&vmi);
        if (!retval) {
+               /* Allow a BPF program to modify the new mm_struct in fork. */
+               bpf_hook_mm_fork(mm, oldmm);
                mt_set_in_rcu(vmi.mas.tree);
                ksm_fork(mm, oldmm);
                khugepaged_fork(mm, oldmm);

This provides a mechanism for BPF programs to configure the new
mm_struct on demand, acting as a modern, flexible replacement for
prctl() ;-)

>
> >
> > Separately from this patchset, I realized we can optimize khugepaged
> > handling for the MMF_DISABLE_THP case with the following changes:
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 15203ea7d007..e9964edcee29 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -402,6 +402,11 @@ void __init khugepaged_destroy(void)
> >         kmem_cache_destroy(mm_slot_cache);
> >  }
> >
> > +static inline int hpage_collapse_test_disable(struct mm_struct *mm)
> > +{
> > +       return test_bit(MMF_DISABLE_THP, &mm->flags);
> > +}
> > +
> >  static inline int hpage_collapse_test_exit(struct mm_struct *mm)
> >  {
> >         return atomic_read(&mm->mm_users) == 0;
> > @@ -1448,6 +1453,11 @@ static void collect_mm_slot(struct
> > khugepaged_mm_slot *mm_slot)
> >                 /* khugepaged_mm_lock actually not necessary for the below */
> >                 mm_slot_free(mm_slot_cache, mm_slot);
> >                 mmdrop(mm);
> > +       } else if (hpage_collapse_test_disable(mm)) {
> > +               hash_del(&slot->hash);
> > +               list_del(&slot->mm_node);
> > +               mm_flags_clear(MMF_VM_HUGEPAGE, mm);
> > +               mm_slot_free(mm_slot_cache, mm_slot);
> >         }
> >  }
> >
> > Specifically, if MMF_DISABLE_THP is set, we should remove it from
> > mm_slot to prevent unnecessary khugepaged processing.
>
> Ohhh interesting, perhaps send as separate patch?

sure, I will send it separately.

-- 
Regards
Yafang


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 01/10] mm: thp: add support for BPF based THP order selection
  2025-09-02  2:48                 ` Yafang Shao
@ 2025-09-02  7:50                   ` Lorenzo Stoakes
  2025-09-03  2:10                     ` Yafang Shao
  0 siblings, 1 reply; 61+ messages in thread
From: Lorenzo Stoakes @ 2025-09-02  7:50 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, bpf, linux-mm,
	linux-doc

On Tue, Sep 02, 2025 at 10:48:47AM +0800, Yafang Shao wrote:
> > >
> > > >
> > > > >
> > > > > However, when you switch the THP mode to "never", tasks that still
> > > > > have MMF_VM_HUGEPAGE remain on the khugepaged scan list. This isn’t an
> > > > > issue under the current global mode because khugepaged doesn’t run
> > > > > when THP is set to "never".
> > > > >
> > > > > The problem arises when we move from a global mode to a per-task mode.
> > > > > In that case, khugepaged may end up doing unnecessary work. For
> > > > > example, if the THP mode is "always", but some tasks are not allowed
> > > > > to allocate THP while still having MMF_VM_HUGEPAGE set, khugepaged
> > > > > will continue scanning them unnecessarily.
> > > >
> > > > But this can change right?
> > > >
> > > > I really don't like the idea _at all_ of overriding this hook to do things
> > > > other than what it says it does.
> > > >
> > > > It's 'set which order to use' except when it's this case then it's 'will we
> > > > do any work'.
> > > >
> > > > This should be a separate callback or we should drop this and live with the
> > > > possible additional work.
> > >
> > > Perhaps we could reuse the MMF_DISABLE_THP flag by introducing a new
> > > BPF helper to set it when we want to disable THP for a specific task.
> >
> > Interesting, yeah perhaps that could work, as long as we're in a sensible
> > context to be able to toggle this bit.
>
> Right, we can't set the mm->flags arbitrarily.
> Perhaps we should add a generic BPF hook in dup_mmap().
>

Yeah perhaps that could be a way forward :)

> diff --git a/mm/mmap.c b/mm/mmap.c
> index 7a057e0e8da9..1b60bdb08de1 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1843,6 +1843,8 @@ __latent_entropy int dup_mmap(struct mm_struct
> *mm, struct mm_struct *oldmm)
>  loop_out:
>         vma_iter_free(&vmi);
>         if (!retval) {
> +               /* Allow a BPF program to modify the new mm_struct in fork. */
> +               bpf_hook_mm_fork(mm, oldmm);
>                 mt_set_in_rcu(vmi.mas.tree);
>                 ksm_fork(mm, oldmm);
>                 khugepaged_fork(mm, oldmm);
>
> This provides a mechanism for BPF programs to configure the new
> mm_struct on demand, acting as a modern, flexible replacement for
> prctl() ;-)

Hahaha that's obviously very appealing to me :)))

>
> >
> > >
> > > Separately from this patchset, I realized we can optimize khugepaged
> > > handling for the MMF_DISABLE_THP case with the following changes:
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index 15203ea7d007..e9964edcee29 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -402,6 +402,11 @@ void __init khugepaged_destroy(void)
> > >         kmem_cache_destroy(mm_slot_cache);
> > >  }
> > >
> > > +static inline int hpage_collapse_test_disable(struct mm_struct *mm)
> > > +{
> > > +       return test_bit(MMF_DISABLE_THP, &mm->flags);
> > > +}
> > > +
> > >  static inline int hpage_collapse_test_exit(struct mm_struct *mm)
> > >  {
> > >         return atomic_read(&mm->mm_users) == 0;
> > > @@ -1448,6 +1453,11 @@ static void collect_mm_slot(struct
> > > khugepaged_mm_slot *mm_slot)
> > >                 /* khugepaged_mm_lock actually not necessary for the below */
> > >                 mm_slot_free(mm_slot_cache, mm_slot);
> > >                 mmdrop(mm);
> > > +       } else if (hpage_collapse_test_disable(mm)) {
> > > +               hash_del(&slot->hash);
> > > +               list_del(&slot->mm_node);
> > > +               mm_flags_clear(MMF_VM_HUGEPAGE, mm);
> > > +               mm_slot_free(mm_slot_cache, mm_slot);
> > >         }
> > >  }
> > >
> > > Specifically, if MMF_DISABLE_THP is set, we should remove it from
> > > mm_slot to prevent unnecessary khugepaged processing.
> >
> > Ohhh interesting, perhaps send as separate patch?
>
> sure, I will send it separately.

Thanks!

>
> --
> Regards
> Yafang

And overall - cheers for being an ABSOLUTE DELIGHT on review :) it's much
appreciated. I shall buy you a beer (or whatever is your preferred
beverage) at the next conference we are both at :)

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v6 mm-new 01/10] mm: thp: add support for BPF based THP order selection
  2025-09-02  7:50                   ` Lorenzo Stoakes
@ 2025-09-03  2:10                     ` Yafang Shao
  0 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2025-09-03  2:10 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, bpf, linux-mm,
	linux-doc

On Tue, Sep 2, 2025 at 3:50 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Tue, Sep 02, 2025 at 10:48:47AM +0800, Yafang Shao wrote:
> > > >
> > > > >
> > > > > >
> > > > > > However, when you switch the THP mode to "never", tasks that still
> > > > > > have MMF_VM_HUGEPAGE remain on the khugepaged scan list. This isn’t an
> > > > > > issue under the current global mode because khugepaged doesn’t run
> > > > > > when THP is set to "never".
> > > > > >
> > > > > > The problem arises when we move from a global mode to a per-task mode.
> > > > > > In that case, khugepaged may end up doing unnecessary work. For
> > > > > > example, if the THP mode is "always", but some tasks are not allowed
> > > > > > to allocate THP while still having MMF_VM_HUGEPAGE set, khugepaged
> > > > > > will continue scanning them unnecessarily.
> > > > >
> > > > > But this can change right?
> > > > >
> > > > > I really don't like the idea _at all_ of overriding this hook to do things
> > > > > other than what it says it does.
> > > > >
> > > > > It's 'set which order to use' except when it's this case then it's 'will we
> > > > > do any work'.
> > > > >
> > > > > This should be a separate callback or we should drop this and live with the
> > > > > possible additional work.
> > > >
> > > > Perhaps we could reuse the MMF_DISABLE_THP flag by introducing a new
> > > > BPF helper to set it when we want to disable THP for a specific task.
> > >
> > > Interesting, yeah perhaps that could work, as long as we're in a sensible
> > > context to be able to toggle this bit.
> >
> > Right, we can't set the mm->flags arbitrarily.
> > Perhaps we should add a generic BPF hook in dup_mmap().
> >
>
> Yeah perhaps that could be a way forward :)
>
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 7a057e0e8da9..1b60bdb08de1 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -1843,6 +1843,8 @@ __latent_entropy int dup_mmap(struct mm_struct
> > *mm, struct mm_struct *oldmm)
> >  loop_out:
> >         vma_iter_free(&vmi);
> >         if (!retval) {
> > +               /* Allow a BPF program to modify the new mm_struct in fork. */
> > +               bpf_hook_mm_fork(mm, oldmm);
> >                 mt_set_in_rcu(vmi.mas.tree);
> >                 ksm_fork(mm, oldmm);
> >                 khugepaged_fork(mm, oldmm);
> >
> > This provides a mechanism for BPF programs to configure the new
> > mm_struct on demand, acting as a modern, flexible replacement for
> > prctl() ;-)
>
> Hahaha that's obviously very appealing to me :)))
>
> >
> > >
> > > >
> > > > Separately from this patchset, I realized we can optimize khugepaged
> > > > handling for the MMF_DISABLE_THP case with the following changes:
> > > >
> > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > index 15203ea7d007..e9964edcee29 100644
> > > > --- a/mm/khugepaged.c
> > > > +++ b/mm/khugepaged.c
> > > > @@ -402,6 +402,11 @@ void __init khugepaged_destroy(void)
> > > >         kmem_cache_destroy(mm_slot_cache);
> > > >  }
> > > >
> > > > +static inline int hpage_collapse_test_disable(struct mm_struct *mm)
> > > > +{
> > > > +       return test_bit(MMF_DISABLE_THP, &mm->flags);
> > > > +}
> > > > +
> > > >  static inline int hpage_collapse_test_exit(struct mm_struct *mm)
> > > >  {
> > > >         return atomic_read(&mm->mm_users) == 0;
> > > > @@ -1448,6 +1453,11 @@ static void collect_mm_slot(struct
> > > > khugepaged_mm_slot *mm_slot)
> > > >                 /* khugepaged_mm_lock actually not necessary for the below */
> > > >                 mm_slot_free(mm_slot_cache, mm_slot);
> > > >                 mmdrop(mm);
> > > > +       } else if (hpage_collapse_test_disable(mm)) {
> > > > +               hash_del(&slot->hash);
> > > > +               list_del(&slot->mm_node);
> > > > +               mm_flags_clear(MMF_VM_HUGEPAGE, mm);
> > > > +               mm_slot_free(mm_slot_cache, mm_slot);
> > > >         }
> > > >  }
> > > >
> > > > Specifically, if MMF_DISABLE_THP is set, we should remove it from
> > > > mm_slot to prevent unnecessary khugepaged processing.
> > >
> > > Ohhh interesting, perhaps send as separate patch?
> >
> > sure, I will send it separately.
>
> Thanks!
>
> >
> > --
> > Regards
> > Yafang
>
> And overall - cheers for being an ABSOLUTE DELIGHT on review :) it's much
> appreciated. I shall buy you a beer (or whatever is your preferred
> beverage) at the next conference we are both at :)

Honestly, that's exactly what I wanted to say to you too! I learned so
much during your review process, and I owe you a beer (or your drink
of choice) as well!

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2025-09-03  2:11 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-26  7:19 [PATCH v6 mm-new 00/10] mm, bpf: BPF based THP order selection Yafang Shao
2025-08-26  7:19 ` [PATCH v6 mm-new 01/10] mm: thp: add support for " Yafang Shao
2025-08-27  2:57   ` kernel test robot
2025-08-27 11:39     ` Yafang Shao
2025-08-27 15:04       ` Lorenzo Stoakes
2025-08-27 15:03   ` Lorenzo Stoakes
2025-08-28  5:54     ` Yafang Shao
2025-08-28 10:50       ` Lorenzo Stoakes
2025-08-29  3:01         ` Yafang Shao
2025-08-29 10:42           ` Lorenzo Stoakes
2025-08-31  3:11             ` Yafang Shao
2025-09-01 11:39               ` Lorenzo Stoakes
2025-09-02  2:48                 ` Yafang Shao
2025-09-02  7:50                   ` Lorenzo Stoakes
2025-09-03  2:10                     ` Yafang Shao
2025-08-29  4:56   ` Barry Song
2025-08-29  5:36     ` Yafang Shao
2025-08-26  7:19 ` [PATCH v6 mm-new 02/10] mm: thp: add a new kfunc bpf_mm_get_mem_cgroup() Yafang Shao
2025-08-27 15:34   ` Lorenzo Stoakes
2025-08-27 20:50     ` Shakeel Butt
2025-08-28 10:40       ` Lorenzo Stoakes
2025-08-28 16:00         ` Shakeel Butt
2025-08-29 10:45           ` Lorenzo Stoakes
2025-08-28  6:57     ` Yafang Shao
2025-08-28 10:42       ` Lorenzo Stoakes
2025-08-29  3:09         ` Yafang Shao
2025-08-27 20:45   ` Shakeel Butt
2025-08-28  6:58     ` Yafang Shao
2025-08-26  7:19 ` [PATCH v6 mm-new 03/10] mm: thp: add a new kfunc bpf_mm_get_task() Yafang Shao
2025-08-27 15:42   ` Lorenzo Stoakes
2025-08-27 21:50     ` Andrii Nakryiko
2025-08-28  6:50       ` Yafang Shao
2025-08-28 10:51       ` Lorenzo Stoakes
2025-08-29  3:15         ` Yafang Shao
2025-08-29 10:42           ` Lorenzo Stoakes
2025-08-28  6:47     ` Yafang Shao
2025-08-29 10:43       ` Lorenzo Stoakes
2025-08-26  7:19 ` [PATCH v6 mm-new 04/10] bpf: mark vma->vm_mm as trusted Yafang Shao
2025-08-27 15:45   ` Lorenzo Stoakes
2025-08-28  6:12     ` Yafang Shao
2025-08-28 11:11       ` Lorenzo Stoakes
2025-08-29  3:05         ` Yafang Shao
2025-08-29 10:49           ` Lorenzo Stoakes
2025-08-31  3:16             ` Yafang Shao
2025-09-01 10:36               ` Lorenzo Stoakes
2025-08-26  7:19 ` [PATCH v6 mm-new 05/10] selftests/bpf: add a simple BPF based THP policy Yafang Shao
2025-08-26  7:19 ` [PATCH v6 mm-new 06/10] selftests/bpf: add test case for khugepaged fork Yafang Shao
2025-08-26  7:19 ` [PATCH v6 mm-new 07/10] selftests/bpf: add test case to update thp policy Yafang Shao
2025-08-26  7:19 ` [PATCH v6 mm-new 08/10] selftests/bpf: add test cases for invalid thp_adjust usage Yafang Shao
2025-08-26  7:19 ` [PATCH v6 mm-new 09/10] Documentation: add BPF-based THP adjustment documentation Yafang Shao
2025-08-26  7:19 ` [PATCH v6 mm-new 10/10] MAINTAINERS: add entry for BPF-based THP adjustment Yafang Shao
2025-08-27 15:47   ` Lorenzo Stoakes
2025-08-28  6:08     ` Yafang Shao
2025-08-26  7:42 ` [PATCH v6 mm-new 00/10] mm, bpf: BPF based THP order selection David Hildenbrand
2025-08-26  8:33   ` Lorenzo Stoakes
2025-08-26 12:06     ` Yafang Shao
2025-08-26  9:52   ` Usama Arif
2025-08-26 12:10     ` Yafang Shao
2025-08-26 12:03   ` Yafang Shao
2025-08-27 13:14 ` Lorenzo Stoakes
2025-08-28  2:58   ` Yafang Shao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).