[RFC PATCH v5 mm-new 0/5] mm, bpf: BPF based THP order selection

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v5 mm-new 0/5] mm, bpf: BPF based THP order selection
@ 2025-08-18  5:55 Yafang Shao
  2025-08-18  5:55 ` [RFC PATCH v5 mm-new 1/5] mm: thp: add support for " Yafang Shao
                   ` (5 more replies)
  0 siblings, 6 replies; 17+ messages in thread
From: Yafang Shao @ 2025-08-18  5:55 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes
  Cc: bpf, linux-mm, Yafang Shao

Background
----------

Our production servers consistently configure THP to "never" due to
historical incidents caused by its behavior. Key issues include:
- Increased Memory Consumption
  THP significantly raises overall memory usage, reducing available memory
  for workloads.

- Latency Spikes
  Random latency spikes occur due to frequent memory compaction triggered
  by THP.

- Lack of Fine-Grained Control
  THP tuning is globally configured, making it unsuitable for containerized
  environments. When multiple workloads share a host, enabling THP without
  per-workload control leads to unpredictable behavior.

Due to these issues, administrators avoid switching to madvise or always
modes—unless per-workload THP control is implemented.

To address this, we propose BPF-based THP policy for flexible adjustment.
Additionally, as David mentioned [0], this mechanism can also serve as a
policy prototyping tool (test policies via BPF before upstreaming them).

Proposed Solution
-----------------

As suggested by David [0], we introduce a new BPF interface:

/**
 * @get_suggested_order: Get the suggested THP orders for allocation
 * @mm: mm_struct associated with the THP allocation
 * @vma__nullable: vm_area_struct associated with the THP allocation (may be NULL)
 *                 When NULL, the decision should be based on @mm (i.e., when
 *                 triggered from an mm-scope hook rather than a VMA-specific
 *                 context).
 *                 Must belong to @mm (guaranteed by the caller).
 * @vma_flags: use these vm_flags instead of @vma->vm_flags (0 if @vma is NULL)
 * @tva_flags: TVA flags for current @vma (-1 if @vma is NULL)
 * @orders: Bitmask of requested THP orders for this allocation
 *          - PMD-mapped allocation if PMD_ORDER is set
 *          - mTHP allocation otherwise
 *
 * Rerurn: Bitmask of suggested THP orders for allocation. The highest
 *         suggested order will not exceed the highest requested order
 *         in @orders.
 */
 int (*get_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
			    u64 vma_flags, enum tva_type tva_flags, int orders) __rcu;

This interface:
- Supports both use cases (per-workload tuning + policy prototyping).
- Can be extended with BPF helpers (e.g., for memory pressure awareness).

This is an experimental feature. To use it, you must enable
CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION.

Warning:
- The interface may change
- Behavior may differ in future kernel versions
- We might remove it in the future

A simple test case is included in Patch #4.

Future work:
- Extend it to File THP

Changes:
RFC v4->v5:
- Add support for vma (David)
- Add mTHP support in khugepaged (Zi)
- Use bitmask of all allowed orders instead (Zi)
- Retrieve the page size and PMD order rather than hardcoding them (Zi)

RFC v3->v4: https://lwn.net/Articles/1031829/
- Use a new interface get_suggested_order() (David)
- Mark it as experimental (David, Lorenzo)
- Code improvement in THP (Usama)
- Code improvement in BPF struct ops (Amery)

RFC v2->v3: https://lwn.net/Articles/1024545/
- Finer-graind tuning based on madvise or always mode (David, Lorenzo)
- Use BPF to write more advanced policies logic (David, Lorenzo)

RFC v1->v2: https://lwn.net/Articles/1021783/
The main changes are as follows,
- Use struct_ops instead of fmod_ret (Alexei)
- Introduce a new THP mode (Johannes)
- Introduce new helpers for BPF hook (Zi)
- Refine the commit log

RFC v1: https://lwn.net/Articles/1019290/
Yafang Shao (5):
  mm: thp: add support for BPF based THP order selection
  mm: thp: add a new kfunc bpf_mm_get_mem_cgroup()
  mm: thp: add a new kfunc bpf_mm_get_task()
  bpf: mark vma->vm_mm as trusted
  selftest/bpf: add selftest for BPF based THP order seletection

 include/linux/huge_mm.h                       |  15 +
 include/linux/khugepaged.h                    |  12 +-
 kernel/bpf/verifier.c                         |   5 +
 mm/Kconfig                                    |  12 +
 mm/Makefile                                   |   1 +
 mm/bpf_thp.c                                  | 269 ++++++++++++++++++
 mm/huge_memory.c                              |  10 +
 mm/khugepaged.c                               |  26 +-
 mm/memory.c                                   |  18 +-
 tools/testing/selftests/bpf/config            |   3 +
 .../selftests/bpf/prog_tests/thp_adjust.c     | 224 +++++++++++++++
 .../selftests/bpf/progs/test_thp_adjust.c     |  76 +++++
 .../bpf/progs/test_thp_adjust_failure.c       |  25 ++
 13 files changed, 689 insertions(+), 7 deletions(-)
 create mode 100644 mm/bpf_thp.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_failure.c

-- 
2.47.3



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC PATCH v5 mm-new 1/5] mm: thp: add support for BPF based THP order selection
  2025-08-18  5:55 [RFC PATCH v5 mm-new 0/5] mm, bpf: BPF based THP order selection Yafang Shao
@ 2025-08-18  5:55 ` Yafang Shao
  2025-08-18 13:17   ` Usama Arif
  2025-08-18  5:55 ` [RFC PATCH v5 mm-new 2/5] mm: thp: add a new kfunc bpf_mm_get_mem_cgroup() Yafang Shao
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: Yafang Shao @ 2025-08-18  5:55 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes
  Cc: bpf, linux-mm, Yafang Shao

This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
THP tuning. It includes a hook get_suggested_order() [0], allowing BPF
programs to influence THP order selection based on factors such as:
- Workload identity
  For example, workloads running in specific containers or cgroups.
- Allocation context
  Whether the allocation occurs during a page fault, khugepaged, or other
  paths.
- System memory pressure
  (May require new BPF helpers to accurately assess memory pressure.)

Key Details:
- Only one BPF program can be attached at a time, but it can be updated
  dynamically to adjust the policy.
- Supports automatic mTHP order selection and per-workload THP policies.
- Only functional when THP is set to madise or always.

It requires CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION to enable. [1]
This feature is unstable and may evolve in future kernel versions.

Link: https://lwn.net/ml/all/9bc57721-5287-416c-aa30-46932d605f63@redhat.com/ [0]
Link: https://lwn.net/ml/all/dda67ea5-2943-497c-a8e5-d81f0733047d@lucifer.local/ [1]

Suggested-by: David Hildenbrand <david@redhat.com>
Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/huge_mm.h    |  15 +++
 include/linux/khugepaged.h |  12 ++-
 mm/Kconfig                 |  12 +++
 mm/Makefile                |   1 +
 mm/bpf_thp.c               | 186 +++++++++++++++++++++++++++++++++++++
 mm/huge_memory.c           |  10 ++
 mm/khugepaged.c            |  26 +++++-
 mm/memory.c                |  18 +++-
 8 files changed, 273 insertions(+), 7 deletions(-)
 create mode 100644 mm/bpf_thp.c

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 1ac0d06fb3c1..f0c91d7bd267 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -6,6 +6,8 @@
 
 #include <linux/fs.h> /* only for vma_is_dax() */
 #include <linux/kobject.h>
+#include <linux/pgtable.h>
+#include <linux/mm.h>
 
 vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
@@ -56,6 +58,7 @@ enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
 	TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
+	TRANSPARENT_HUGEPAGE_BPF_ATTACHED,      /* BPF prog is attached */
 };
 
 struct kobject;
@@ -195,6 +198,18 @@ static inline bool hugepage_global_always(void)
 			(1<<TRANSPARENT_HUGEPAGE_FLAG);
 }
 
+#ifdef CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION
+int get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
+			u64 vma_flags, enum tva_type tva_flags, int orders);
+#else
+static inline int
+get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
+		    u64 vma_flags, enum tva_type tva_flags, int orders)
+{
+	return orders;
+}
+#endif
+
 static inline int highest_order(unsigned long orders)
 {
 	return fls_long(orders) - 1;
diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index eb1946a70cff..d81c1228a21f 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -4,6 +4,8 @@
 
 #include <linux/mm.h>
 
+#include <linux/huge_mm.h>
+
 extern unsigned int khugepaged_max_ptes_none __read_mostly;
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 extern struct attribute_group khugepaged_attr_group;
@@ -22,7 +24,15 @@ extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 
 static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
 {
-	if (mm_flags_test(MMF_VM_HUGEPAGE, oldmm))
+	/*
+	 * THP allocation policy can be dynamically modified via BPF. Even if a
+	 * task was allowed to allocate THPs, BPF can decide whether its forked
+	 * child can allocate THPs.
+	 *
+	 * The MMF_VM_HUGEPAGE flag will be cleared by khugepaged.
+	 */
+	if (mm_flags_test(MMF_VM_HUGEPAGE, oldmm) &&
+		get_suggested_order(mm, NULL, 0, -1, BIT(PMD_ORDER)))
 		__khugepaged_enter(mm);
 }
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 4108bcd96784..d10089e3f181 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -924,6 +924,18 @@ config NO_PAGE_MAPCOUNT
 
 	  EXPERIMENTAL because the impact of some changes is still unclear.
 
+config EXPERIMENTAL_BPF_ORDER_SELECTION
+	bool "BPF-based THP order selection (EXPERIMENTAL)"
+	depends on TRANSPARENT_HUGEPAGE && BPF_SYSCALL
+
+	help
+	  Enable dynamic THP order selection using BPF programs. This
+	  experimental feature allows custom BPF logic to determine optimal
+	  transparent hugepage allocation sizes at runtime.
+
+	  Warning: This feature is unstable and may change in future kernel
+	  versions.
+
 endif # TRANSPARENT_HUGEPAGE
 
 # simple helper to make the code a bit easier to read
diff --git a/mm/Makefile b/mm/Makefile
index ef54aa615d9d..cb55d1509be1 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -99,6 +99,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_NUMA) += memory-tiers.o
 obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
+obj-$(CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION) += bpf_thp.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
 obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
 obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
new file mode 100644
index 000000000000..2b03539452d1
--- /dev/null
+++ b/mm/bpf_thp.c
@@ -0,0 +1,186 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/huge_mm.h>
+#include <linux/khugepaged.h>
+
+struct bpf_thp_ops {
+	/**
+	 * @get_suggested_order: Get the suggested THP orders for allocation
+	 * @mm: mm_struct associated with the THP allocation
+	 * @vma__nullable: vm_area_struct associated with the THP allocation (may be NULL)
+	 *                 When NULL, the decision should be based on @mm (i.e., when
+	 *                 triggered from an mm-scope hook rather than a VMA-specific
+	 *                 context).
+	 *                 Must belong to @mm (guaranteed by the caller).
+	 * @vma_flags: use these vm_flags instead of @vma->vm_flags (0 if @vma is NULL)
+	 * @tva_flags: TVA flags for current @vma (-1 if @vma is NULL)
+	 * @orders: Bitmask of requested THP orders for this allocation
+	 *          - PMD-mapped allocation if PMD_ORDER is set
+	 *          - mTHP allocation otherwise
+	 *
+	 * Rerurn: Bitmask of suggested THP orders for allocation. The highest
+	 *         suggested order will not exceed the highest requested order
+	 *         in @orders.
+	 */
+	int (*get_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
+				   u64 vma_flags, enum tva_type tva_flags, int orders) __rcu;
+};
+
+static struct bpf_thp_ops bpf_thp;
+static DEFINE_SPINLOCK(thp_ops_lock);
+
+int get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
+			u64 vma_flags, enum tva_type tva_flags, int orders)
+{
+	int (*bpf_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
+				   u64 vma_flags, enum tva_type tva_flags, int orders);
+	int suggested_orders = orders;
+
+	/* No BPF program is attached */
+	if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
+		      &transparent_hugepage_flags))
+		return suggested_orders;
+
+	rcu_read_lock();
+	bpf_suggested_order = rcu_dereference(bpf_thp.get_suggested_order);
+	if (!bpf_suggested_order)
+		goto out;
+
+	suggested_orders = bpf_suggested_order(mm, vma__nullable, vma_flags, tva_flags, orders);
+	if (highest_order(suggested_orders) > highest_order(orders))
+		suggested_orders = orders;
+
+out:
+	rcu_read_unlock();
+	return suggested_orders;
+}
+
+static bool bpf_thp_ops_is_valid_access(int off, int size,
+					enum bpf_access_type type,
+					const struct bpf_prog *prog,
+					struct bpf_insn_access_aux *info)
+{
+	return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
+}
+
+static const struct bpf_func_proto *
+bpf_thp_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	return bpf_base_func_proto(func_id, prog);
+}
+
+static const struct bpf_verifier_ops thp_bpf_verifier_ops = {
+	.get_func_proto = bpf_thp_get_func_proto,
+	.is_valid_access = bpf_thp_ops_is_valid_access,
+};
+
+static int bpf_thp_init(struct btf *btf)
+{
+	return 0;
+}
+
+static int bpf_thp_init_member(const struct btf_type *t,
+			       const struct btf_member *member,
+			       void *kdata, const void *udata)
+{
+	return 0;
+}
+
+static int bpf_thp_reg(void *kdata, struct bpf_link *link)
+{
+	struct bpf_thp_ops *ops = kdata;
+
+	spin_lock(&thp_ops_lock);
+	if (test_and_set_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
+		&transparent_hugepage_flags)) {
+		spin_unlock(&thp_ops_lock);
+		return -EBUSY;
+	}
+	WARN_ON_ONCE(bpf_thp.get_suggested_order);
+	WRITE_ONCE(bpf_thp.get_suggested_order, ops->get_suggested_order);
+	spin_unlock(&thp_ops_lock);
+	return 0;
+}
+
+static void bpf_thp_unreg(void *kdata, struct bpf_link *link)
+{
+	spin_lock(&thp_ops_lock);
+	clear_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, &transparent_hugepage_flags);
+	WARN_ON_ONCE(!bpf_thp.get_suggested_order);
+	rcu_replace_pointer(bpf_thp.get_suggested_order, NULL, lockdep_is_held(&thp_ops_lock));
+	spin_unlock(&thp_ops_lock);
+
+	synchronize_rcu();
+}
+
+static int bpf_thp_update(void *kdata, void *old_kdata, struct bpf_link *link)
+{
+	struct bpf_thp_ops *ops = kdata;
+	struct bpf_thp_ops *old = old_kdata;
+	int ret = 0;
+
+	if (!ops || !old)
+		return -EINVAL;
+
+	spin_lock(&thp_ops_lock);
+	/* The prog has aleady been removed. */
+	if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, &transparent_hugepage_flags)) {
+		ret = -ENOENT;
+		goto out;
+	}
+	WARN_ON_ONCE(!bpf_thp.get_suggested_order);
+	rcu_replace_pointer(bpf_thp.get_suggested_order, ops->get_suggested_order,
+			    lockdep_is_held(&thp_ops_lock));
+
+out:
+	spin_unlock(&thp_ops_lock);
+	if (!ret)
+		synchronize_rcu();
+	return ret;
+}
+
+static int bpf_thp_validate(void *kdata)
+{
+	struct bpf_thp_ops *ops = kdata;
+
+	if (!ops->get_suggested_order) {
+		pr_err("bpf_thp: required ops isn't implemented\n");
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static int suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
+			   u64 vma_flags, enum tva_type vm_flags, int orders)
+{
+	return orders;
+}
+
+static struct bpf_thp_ops __bpf_thp_ops = {
+	.get_suggested_order = suggested_order,
+};
+
+static struct bpf_struct_ops bpf_bpf_thp_ops = {
+	.verifier_ops = &thp_bpf_verifier_ops,
+	.init = bpf_thp_init,
+	.init_member = bpf_thp_init_member,
+	.reg = bpf_thp_reg,
+	.unreg = bpf_thp_unreg,
+	.update = bpf_thp_update,
+	.validate = bpf_thp_validate,
+	.cfi_stubs = &__bpf_thp_ops,
+	.owner = THIS_MODULE,
+	.name = "bpf_thp_ops",
+};
+
+static int __init bpf_thp_ops_init(void)
+{
+	int err = register_bpf_struct_ops(&bpf_bpf_thp_ops, bpf_thp_ops);
+
+	if (err)
+		pr_err("bpf_thp: Failed to register struct_ops (%d)\n", err);
+	return err;
+}
+late_initcall(bpf_thp_ops_init);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d89992b65acc..bd8f8f34ab3c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1349,6 +1349,16 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 		return ret;
 	khugepaged_enter_vma(vma, vma->vm_flags);
 
+	/*
+	 * This check must occur after khugepaged_enter_vma() because:
+	 * 1. We may permit THP allocation via khugepaged
+	 * 2. While simultaneously disallowing THP allocation
+	 *    during page fault handling
+	 */
+	if (get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_PAGEFAULT, BIT(PMD_ORDER)) !=
+				BIT(PMD_ORDER))
+		return VM_FAULT_FALLBACK;
+
 	if (!(vmf->flags & FAULT_FLAG_WRITE) &&
 			!mm_forbids_zeropage(vma->vm_mm) &&
 			transparent_hugepage_use_zero_page()) {
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d3d4f116e14b..935583626db6 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -474,7 +474,9 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
 {
 	if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
 	    hugepage_pmd_enabled()) {
-		if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
+		if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER) &&
+		    get_suggested_order(vma->vm_mm, vma, vm_flags, TVA_KHUGEPAGED,
+					BIT(PMD_ORDER)))
 			__khugepaged_enter(vma->vm_mm);
 	}
 }
@@ -934,6 +936,8 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 		return SCAN_ADDRESS_RANGE;
 	if (!thp_vma_allowable_order(vma, vma->vm_flags, type, PMD_ORDER))
 		return SCAN_VMA_CHECK;
+	if (!get_suggested_order(vma->vm_mm, vma, vma->vm_flags, type, BIT(PMD_ORDER)))
+		return SCAN_VMA_CHECK;
 	/*
 	 * Anon VMA expected, the address may be unmapped then
 	 * remapped to file after khugepaged reaquired the mmap_lock.
@@ -1465,6 +1469,11 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
 		/* khugepaged_mm_lock actually not necessary for the below */
 		mm_slot_free(mm_slot_cache, mm_slot);
 		mmdrop(mm);
+	} else if (!get_suggested_order(mm, NULL, 0, -1, BIT(PMD_ORDER))) {
+		hash_del(&slot->hash);
+		list_del(&slot->mm_node);
+		mm_flags_clear(MMF_VM_HUGEPAGE, mm);
+		mm_slot_free(mm_slot_cache, mm_slot);
 	}
 }
 
@@ -1538,6 +1547,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
 		return SCAN_VMA_CHECK;
 
+	if (!get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_FORCED_COLLAPSE,
+				 BIT(PMD_ORDER)))
+		return SCAN_VMA_CHECK;
 	/* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
 	if (userfaultfd_wp(vma))
 		return SCAN_PTE_UFFD_WP;
@@ -2416,6 +2428,10 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 	 * the next mm on the list.
 	 */
 	vma = NULL;
+
+	/* If this mm is not suitable for the scan list, we should remove it. */
+	if (!get_suggested_order(mm, NULL, 0, -1, BIT(PMD_ORDER)))
+		goto breakouterloop_mmap_lock;
 	if (unlikely(!mmap_read_trylock(mm)))
 		goto breakouterloop_mmap_lock;
 
@@ -2432,7 +2448,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 			progress++;
 			break;
 		}
-		if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
+		if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER) ||
+		    !get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_KHUGEPAGED,
+					 BIT(PMD_ORDER))) {
 skip:
 			progress++;
 			continue;
@@ -2769,6 +2787,10 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 	if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
 		return -EINVAL;
 
+	if (!get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_FORCED_COLLAPSE,
+				 BIT(PMD_ORDER)))
+		return -EINVAL;
+
 	cc = kmalloc(sizeof(*cc), GFP_KERNEL);
 	if (!cc)
 		return -ENOMEM;
diff --git a/mm/memory.c b/mm/memory.c
index d9de6c056179..0178857aa058 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4486,6 +4486,7 @@ static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset,
 static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
+	int order, suggested_orders;
 	unsigned long orders;
 	struct folio *folio;
 	unsigned long addr;
@@ -4493,7 +4494,6 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 	spinlock_t *ptl;
 	pte_t *pte;
 	gfp_t gfp;
-	int order;
 
 	/*
 	 * If uffd is active for the vma we need per-page fault fidelity to
@@ -4510,13 +4510,18 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 	if (!zswap_never_enabled())
 		goto fallback;
 
+	suggested_orders = get_suggested_order(vma->vm_mm, vma, vma->vm_flags,
+					       TVA_PAGEFAULT,
+					       BIT(PMD_ORDER) - 1);
+	if (!suggested_orders)
+		goto fallback;
 	entry = pte_to_swp_entry(vmf->orig_pte);
 	/*
 	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
 	 * and suitable for swapping THP.
 	 */
 	orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT,
-					  BIT(PMD_ORDER) - 1);
+					  suggested_orders);
 	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
 	orders = thp_swap_suitable_orders(swp_offset(entry),
 					  vmf->address, orders);
@@ -5044,12 +5049,12 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	int order, suggested_orders;
 	unsigned long orders;
 	struct folio *folio;
 	unsigned long addr;
 	pte_t *pte;
 	gfp_t gfp;
-	int order;
 
 	/*
 	 * If uffd is active for the vma we need per-page fault fidelity to
@@ -5058,13 +5063,18 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
 	if (unlikely(userfaultfd_armed(vma)))
 		goto fallback;
 
+	suggested_orders = get_suggested_order(vma->vm_mm, vma, vma->vm_flags,
+					       TVA_PAGEFAULT,
+					       BIT(PMD_ORDER) - 1);
+	if (!suggested_orders)
+		goto fallback;
 	/*
 	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
 	 * for this vma. Then filter out the orders that can't be allocated over
 	 * the faulting address and still be fully contained in the vma.
 	 */
 	orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT,
-					  BIT(PMD_ORDER) - 1);
+					  suggested_orders);
 	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
 
 	if (!orders)
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH v5 mm-new 2/5] mm: thp: add a new kfunc bpf_mm_get_mem_cgroup()
  2025-08-18  5:55 [RFC PATCH v5 mm-new 0/5] mm, bpf: BPF based THP order selection Yafang Shao
  2025-08-18  5:55 ` [RFC PATCH v5 mm-new 1/5] mm: thp: add support for " Yafang Shao
@ 2025-08-18  5:55 ` Yafang Shao
  2025-08-18  5:55 ` [RFC PATCH v5 mm-new 3/5] mm: thp: add a new kfunc bpf_mm_get_task() Yafang Shao
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 17+ messages in thread
From: Yafang Shao @ 2025-08-18  5:55 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes
  Cc: bpf, linux-mm, Yafang Shao

We will utilize this new kfunc bpf_mm_get_mem_cgroup() to retrieve the
associated mem_cgroup from the given @mm. The obtained mem_cgroup must
be released by calling bpf_put_mem_cgroup() as a paired operation.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 mm/bpf_thp.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 50 insertions(+), 1 deletion(-)

diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
index 2b03539452d1..bdcf6f6af99b 100644
--- a/mm/bpf_thp.c
+++ b/mm/bpf_thp.c
@@ -175,10 +175,59 @@ static struct bpf_struct_ops bpf_bpf_thp_ops = {
 	.name = "bpf_thp_ops",
 };
 
+__bpf_kfunc_start_defs();
+
+/**
+ * bpf_mm_get_mem_cgroup - Get the memory cgroup associated with a mm_struct.
+ * @mm: The mm_struct to query
+ *
+ * The obtained mem_cgroup must be released by calling bpf_put_mem_cgroup().
+ *
+ * Return: The associated mem_cgroup on success, or NULL on failure. Note that
+ * this function depends on CONFIG_MEMCG being enabled - it will always return
+ * NULL if CONFIG_MEMCG is not configured.
+ */
+__bpf_kfunc struct mem_cgroup *bpf_mm_get_mem_cgroup(struct mm_struct *mm)
+{
+	return get_mem_cgroup_from_mm(mm);
+}
+
+/**
+ * bpf_put_mem_cgroup - Release a memory cgroup obtained from bpf_mm_get_mem_cgroup()
+ * @memcg: The memory cgroup to release
+ */
+__bpf_kfunc void bpf_put_mem_cgroup(struct mem_cgroup *memcg)
+{
+#ifdef CONFIG_MEMCG
+	if (!memcg)
+		return;
+	css_put(&memcg->css);
+#endif
+}
+
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(bpf_thp_ids)
+BTF_ID_FLAGS(func, bpf_mm_get_mem_cgroup, KF_TRUSTED_ARGS | KF_ACQUIRE | KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_put_mem_cgroup, KF_RELEASE)
+BTF_KFUNCS_END(bpf_thp_ids)
+
+static const struct btf_kfunc_id_set bpf_thp_set = {
+	.owner = THIS_MODULE,
+	.set = &bpf_thp_ids,
+};
+
 static int __init bpf_thp_ops_init(void)
 {
-	int err = register_bpf_struct_ops(&bpf_bpf_thp_ops, bpf_thp_ops);
+	int err;
+
+	err = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &bpf_thp_set);
+	if (err) {
+		pr_err("bpf_thp: Failed to register kfunc sets (%d)\n", err);
+		return err;
+	}
 
+	err = register_bpf_struct_ops(&bpf_bpf_thp_ops, bpf_thp_ops);
 	if (err)
 		pr_err("bpf_thp: Failed to register struct_ops (%d)\n", err);
 	return err;
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH v5 mm-new 3/5] mm: thp: add a new kfunc bpf_mm_get_task()
  2025-08-18  5:55 [RFC PATCH v5 mm-new 0/5] mm, bpf: BPF based THP order selection Yafang Shao
  2025-08-18  5:55 ` [RFC PATCH v5 mm-new 1/5] mm: thp: add support for " Yafang Shao
  2025-08-18  5:55 ` [RFC PATCH v5 mm-new 2/5] mm: thp: add a new kfunc bpf_mm_get_mem_cgroup() Yafang Shao
@ 2025-08-18  5:55 ` Yafang Shao
  2025-08-18  5:55 ` [RFC PATCH v5 mm-new 4/5] bpf: mark vma->vm_mm as trusted Yafang Shao
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 17+ messages in thread
From: Yafang Shao @ 2025-08-18  5:55 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes
  Cc: bpf, linux-mm, Yafang Shao

We will utilize this new kfunc bpf_mm_get_task() to retrieve the
associated task_struct from the given @mm. The obtained task_struct must
be released by calling bpf_task_release() as a paired operation.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 mm/bpf_thp.c | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
index bdcf6f6af99b..8ed1bf0d7f4d 100644
--- a/mm/bpf_thp.c
+++ b/mm/bpf_thp.c
@@ -205,11 +205,45 @@ __bpf_kfunc void bpf_put_mem_cgroup(struct mem_cgroup *memcg)
 #endif
 }
 
+/**
+ * bpf_mm_get_task - Get the task struct associated with a mm_struct.
+ * @mm: The mm_struct to query
+ *
+ * The obtained task_struct must be released by calling bpf_task_release().
+ *
+ * Return: The associated task_struct on success, or NULL on failure. Note that
+ * this function depends on CONFIG_MEMCG being enabled - it will always return
+ * NULL if CONFIG_MEMCG is not configured.
+ */
+__bpf_kfunc struct task_struct *bpf_mm_get_task(struct mm_struct *mm)
+{
+#ifdef CONFIG_MEMCG
+	struct task_struct *task;
+
+	if (!mm)
+		return NULL;
+	rcu_read_lock();
+	task = rcu_dereference(mm->owner);
+	if (!task)
+		goto out;
+	if (!refcount_inc_not_zero(&task->rcu_users))
+		goto out;
+
+	rcu_read_unlock();
+	return task;
+
+out:
+	rcu_read_unlock();
+#endif
+	return NULL;
+}
+
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(bpf_thp_ids)
 BTF_ID_FLAGS(func, bpf_mm_get_mem_cgroup, KF_TRUSTED_ARGS | KF_ACQUIRE | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_put_mem_cgroup, KF_RELEASE)
+BTF_ID_FLAGS(func, bpf_mm_get_task, KF_TRUSTED_ARGS | KF_ACQUIRE | KF_RET_NULL)
 BTF_KFUNCS_END(bpf_thp_ids)
 
 static const struct btf_kfunc_id_set bpf_thp_set = {
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH v5 mm-new 4/5] bpf: mark vma->vm_mm as trusted
  2025-08-18  5:55 [RFC PATCH v5 mm-new 0/5] mm, bpf: BPF based THP order selection Yafang Shao
                   ` (2 preceding siblings ...)
  2025-08-18  5:55 ` [RFC PATCH v5 mm-new 3/5] mm: thp: add a new kfunc bpf_mm_get_task() Yafang Shao
@ 2025-08-18  5:55 ` Yafang Shao
  2025-08-18  5:55 ` [RFC PATCH v5 mm-new 5/5] selftest/bpf: add selftest for BPF based THP order seletection Yafang Shao
  2025-08-18 14:35 ` [RFC PATCH v5 mm-new 0/5] mm, bpf: BPF based THP order selection Usama Arif
  5 siblings, 0 replies; 17+ messages in thread
From: Yafang Shao @ 2025-08-18  5:55 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes
  Cc: bpf, linux-mm, Yafang Shao

Every VMA must have an associated mm_struct, and it is safe to access
outside of RCU. Thus, we can mark it as trusted. With this change, BPF
helpers can safely access vma->vm_mm to retrieve the associated task
from the VMA.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 kernel/bpf/verifier.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index c4f69a9e9af6..984ffbca5cbe 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -7154,6 +7154,10 @@ BTF_TYPE_SAFE_TRUSTED(struct file) {
 	struct inode *f_inode;
 };
 
+BTF_TYPE_SAFE_TRUSTED(struct vm_area_struct) {
+	struct mm_struct *vm_mm;
+};
+
 BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct dentry) {
 	struct inode *d_inode;
 };
@@ -7193,6 +7197,7 @@ static bool type_is_trusted(struct bpf_verifier_env *env,
 	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED(struct bpf_iter__task));
 	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED(struct linux_binprm));
 	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED(struct file));
+	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED(struct vm_area_struct));
 
 	return btf_nested_type_is_trusted(&env->log, reg, field_name, btf_id, "__safe_trusted");
 }
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC PATCH v5 mm-new 5/5] selftest/bpf: add selftest for BPF based THP order seletection
  2025-08-18  5:55 [RFC PATCH v5 mm-new 0/5] mm, bpf: BPF based THP order selection Yafang Shao
                   ` (3 preceding siblings ...)
  2025-08-18  5:55 ` [RFC PATCH v5 mm-new 4/5] bpf: mark vma->vm_mm as trusted Yafang Shao
@ 2025-08-18  5:55 ` Yafang Shao
  2025-08-18 14:00   ` Usama Arif
  2025-08-18 14:35 ` [RFC PATCH v5 mm-new 0/5] mm, bpf: BPF based THP order selection Usama Arif
  5 siblings, 1 reply; 17+ messages in thread
From: Yafang Shao @ 2025-08-18  5:55 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes
  Cc: bpf, linux-mm, Yafang Shao

This self-test verifies that PMD-mapped THP allocation is restricted in
page faults for tasks within a specific cgroup, while still permitting
THP allocation via khugepaged.

Since THP allocation depends on various factors (e.g., system memory
pressure), using the actual allocated THP size for validation is
unreliable. Instead, we check the return value of get_suggested_order(),
which indicates whether the system intends to allocate a THP, regardless of
whether the allocation ultimately succeeds.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 tools/testing/selftests/bpf/config            |   3 +
 .../selftests/bpf/prog_tests/thp_adjust.c     | 224 ++++++++++++++++++
 .../selftests/bpf/progs/test_thp_adjust.c     |  76 ++++++
 .../bpf/progs/test_thp_adjust_failure.c       |  25 ++
 4 files changed, 328 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_failure.c

diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
index 8916ab814a3e..27f0249c7600 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -26,6 +26,7 @@ CONFIG_DMABUF_HEAPS=y
 CONFIG_DMABUF_HEAPS_SYSTEM=y
 CONFIG_DUMMY=y
 CONFIG_DYNAMIC_FTRACE=y
+CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION=y
 CONFIG_FPROBE=y
 CONFIG_FTRACE_SYSCALLS=y
 CONFIG_FUNCTION_ERROR_INJECTION=y
@@ -51,6 +52,7 @@ CONFIG_IPV6_TUNNEL=y
 CONFIG_KEYS=y
 CONFIG_LIRC=y
 CONFIG_LWTUNNEL=y
+CONFIG_MEMCG=y
 CONFIG_MODULE_SIG=y
 CONFIG_MODULE_SRCVERSION_ALL=y
 CONFIG_MODULE_UNLOAD=y
@@ -114,6 +116,7 @@ CONFIG_SECURITY=y
 CONFIG_SECURITYFS=y
 CONFIG_SYN_COOKIES=y
 CONFIG_TEST_BPF=m
+CONFIG_TRANSPARENT_HUGEPAGE=y
 CONFIG_UDMABUF=y
 CONFIG_USERFAULTFD=y
 CONFIG_VSOCKETS=y
diff --git a/tools/testing/selftests/bpf/prog_tests/thp_adjust.c b/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
new file mode 100644
index 000000000000..959ea920b0ef
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
@@ -0,0 +1,224 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <math.h>
+#include <sys/mman.h>
+#include <test_progs.h>
+#include "cgroup_helpers.h"
+#include "test_thp_adjust.skel.h"
+#include "test_thp_adjust_failure.skel.h"
+
+#define LEN (16 * 1024 * 1024) /* 16MB */
+#define THP_ENABLED_FILE "/sys/kernel/mm/transparent_hugepage/enabled"
+#define PMD_SIZE_FILE "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size"
+
+static char *thp_addr;
+static char old_mode[32];
+
+static int thp_mode_save(void)
+{
+	const char *start, *end;
+	char buf[128];
+	int fd, err;
+	size_t len;
+
+	fd = open(THP_ENABLED_FILE, O_RDONLY);
+	if (fd == -1)
+		return -1;
+
+	err = read(fd, buf, sizeof(buf) - 1);
+	if (err == -1)
+		goto close;
+
+	start = strchr(buf, '[');
+	end = start ? strchr(start, ']') : NULL;
+	if (!start || !end || end <= start) {
+		err = -1;
+		goto close;
+	}
+
+	len = end - start - 1;
+	if (len >= sizeof(old_mode))
+		len = sizeof(old_mode) - 1;
+	strncpy(old_mode, start + 1, len);
+	old_mode[len] = '\0';
+
+close:
+	close(fd);
+	return err;
+}
+
+static int thp_mode_set(const char *desired_mode)
+{
+	int fd, err;
+
+	fd = open(THP_ENABLED_FILE, O_RDWR);
+	if (fd == -1)
+		return -1;
+
+	err = write(fd, desired_mode, strlen(desired_mode));
+	close(fd);
+	return err;
+}
+
+static int thp_mode_reset(void)
+{
+	int fd, err;
+
+	fd = open(THP_ENABLED_FILE, O_WRONLY);
+	if (fd == -1)
+		return -1;
+
+	err = write(fd, old_mode, strlen(old_mode));
+	close(fd);
+	return err;
+}
+
+int thp_alloc(long pagesize)
+{
+	int err, i;
+
+	thp_addr = mmap(NULL, LEN, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
+	if (thp_addr == MAP_FAILED)
+		return -1;
+
+	err = madvise(thp_addr, LEN, MADV_HUGEPAGE);
+	if (err == -1)
+		goto unmap;
+
+	/* Accessing a single byte within a page is sufficient to trigger a page fault. */
+	for (i = 0; i < LEN; i += pagesize)
+		thp_addr[i] = 1;
+	return 0;
+
+unmap:
+	munmap(thp_addr, LEN);
+	return -1;
+}
+
+static void thp_free(void)
+{
+	if (!thp_addr)
+		return;
+	munmap(thp_addr, LEN);
+}
+
+static int get_pmd_order(long pagesize)
+{
+	ssize_t bytes_read, size;
+	char buf[64], *endptr;
+	int fd, ret = -1;
+
+	fd = open(PMD_SIZE_FILE, O_RDONLY);
+	if (fd < 0)
+		return -1;
+
+	bytes_read = read(fd, buf, sizeof(buf) - 1);
+	if (bytes_read <= 0)
+		goto close_fd;
+
+	/* Remove potential newline character */
+	if (buf[bytes_read - 1] == '\n')
+		buf[bytes_read - 1] = '\0';
+
+	size = strtoul(buf, &endptr, 10);
+	if (endptr == buf || *endptr != '\0')
+		goto close_fd;
+	if (size % pagesize != 0)
+		goto close_fd;
+	ret = size / pagesize;
+	if ((ret & (ret - 1)) == 0)
+		ret = log2(ret);
+
+close_fd:
+	close(fd);
+	return ret;
+}
+
+static void subtest_thp_adjust(void)
+{
+	struct bpf_link *fentry_link, *ops_link;
+	int err, cgrp_fd, cgrp_id, pmd_order;
+	struct test_thp_adjust *skel;
+	long pagesize;
+
+	pagesize = sysconf(_SC_PAGESIZE);
+	pmd_order = get_pmd_order(pagesize);
+	if (!ASSERT_NEQ(pmd_order, -1, "get_pmd_order"))
+		return;
+
+	err = setup_cgroup_environment();
+	if (!ASSERT_OK(err, "cgrp_env_setup"))
+		return;
+
+	cgrp_fd = create_and_get_cgroup("thp_adjust");
+	if (!ASSERT_GE(cgrp_fd, 0, "create_and_get_cgroup"))
+		goto cleanup;
+
+	err = join_cgroup("thp_adjust");
+	if (!ASSERT_OK(err, "join_cgroup"))
+		goto close_fd;
+
+	cgrp_id = get_cgroup_id("thp_adjust");
+	if (!ASSERT_GE(cgrp_id, 0, "create_and_get_cgroup"))
+		goto join_root;
+
+	if (!ASSERT_NEQ(thp_mode_save(), -1, "THP mode save"))
+		goto join_root;
+	if (!ASSERT_GE(thp_mode_set("madvise"), 0, "THP mode set"))
+		goto join_root;
+
+	skel = test_thp_adjust__open();
+	if (!ASSERT_OK_PTR(skel, "open"))
+		goto thp_reset;
+
+	skel->bss->cgrp_id = cgrp_id;
+	skel->bss->pmd_order = pmd_order;
+
+	err = test_thp_adjust__load(skel);
+	if (!ASSERT_OK(err, "load"))
+		goto destroy;
+
+	fentry_link = bpf_program__attach_trace(skel->progs.thp_run);
+	if (!ASSERT_OK_PTR(fentry_link, "attach fentry"))
+		goto destroy;
+
+	ops_link = bpf_map__attach_struct_ops(skel->maps.thp);
+	if (!ASSERT_OK_PTR(ops_link, "attach struct_ops"))
+		goto destroy;
+
+	if (!ASSERT_NEQ(thp_alloc(pagesize), -1, "THP alloc"))
+		goto destroy;
+
+	/* After attaching struct_ops, THP will be allocated only in khugepaged . */
+	if (!ASSERT_EQ(skel->bss->pf_alloc, 0, "alloc_in_pf"))
+		goto thp_free;
+	if (!ASSERT_GT(skel->bss->pf_disallow, 0, "disallow_in_pf"))
+		goto thp_free;
+
+	if (!ASSERT_GT(skel->bss->khugepaged_alloc, 0, "alloc_in_khugepaged"))
+		goto thp_free;
+	ASSERT_EQ(skel->bss->khugepaged_disallow, 0, "disallow_in_khugepaged");
+
+thp_free:
+	thp_free();
+destroy:
+	test_thp_adjust__destroy(skel);
+thp_reset:
+	ASSERT_GE(thp_mode_reset(), 0, "THP mode reset");
+join_root:
+	/* We must join the root cgroup before removing the created cgroup. */
+	err = join_root_cgroup();
+	ASSERT_OK(err, "join_cgroup to root");
+close_fd:
+	close(cgrp_fd);
+	remove_cgroup("thp_adjust");
+cleanup:
+	cleanup_cgroup_environment();
+}
+
+void test_thp_adjust(void)
+{
+	if (test__start_subtest("thp_adjust"))
+		subtest_thp_adjust();
+	RUN_TESTS(test_thp_adjust_failure);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_thp_adjust.c b/tools/testing/selftests/bpf/progs/test_thp_adjust.c
new file mode 100644
index 000000000000..97908ef29852
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_thp_adjust.c
@@ -0,0 +1,76 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+#define TVA_IN_PF (1 << 1)
+
+int pf_alloc, pf_disallow, khugepaged_alloc, khugepaged_disallow;
+struct mm_struct *target_mm;
+int pmd_order, cgrp_id;
+
+/* Detecting whether a task can successfully allocate THP is unreliable because
+ * it may be influenced by system memory pressure. Instead of making the result
+ * dependent on unpredictable factors, we should simply check
+ * get_suggested_order()'s return value, which is deterministic.
+ */
+SEC("fexit/get_suggested_order")
+int BPF_PROG(thp_run, struct mm_struct *mm, struct vm_area_struct *vma__nullable,
+	     u64 vma_flags, u64 tva_flags, int orders, int retval)
+{
+	if (mm != target_mm)
+		return 0;
+
+	if (orders != (1 << pmd_order))
+		return 0;
+
+	if (tva_flags == TVA_PAGEFAULT) {
+		if (retval == (1 << pmd_order))
+			pf_alloc++;
+		else if (!retval)
+			pf_disallow++;
+	} else if (tva_flags == TVA_KHUGEPAGED || tva_flags == -1) {
+		if (retval == (1 << pmd_order))
+			khugepaged_alloc++;
+		else if (!retval)
+			khugepaged_disallow++;
+	}
+	return 0;
+}
+
+SEC("struct_ops/get_suggested_order")
+int BPF_PROG(bpf_suggested_order, struct mm_struct *mm, struct vm_area_struct *vma__nullable,
+	     u64 vma_flags, enum tva_type tva_flags, int orders)
+{
+	struct mem_cgroup *memcg = bpf_mm_get_mem_cgroup(mm);
+	int suggested_orders = 0;
+
+	/* Only works when CONFIG_MEMCG is enabled. */
+	if (!memcg)
+		return suggested_orders;
+
+	if (memcg->css.cgroup->kn->id == cgrp_id) {
+		if (!target_mm)
+			target_mm = mm;
+		/* BPF THP allocation policy:
+		 * - Allow PMD allocation in khugepagd only
+		 */
+		if ((tva_flags == TVA_KHUGEPAGED || tva_flags == -1) &&
+		    orders == (1 << pmd_order)) {
+			suggested_orders = orders;
+			goto out;
+		}
+	}
+
+out:
+	bpf_put_mem_cgroup(memcg);
+	return suggested_orders;
+}
+
+SEC(".struct_ops.link")
+struct bpf_thp_ops thp = {
+	.get_suggested_order = (void *)bpf_suggested_order,
+};
diff --git a/tools/testing/selftests/bpf/progs/test_thp_adjust_failure.c b/tools/testing/selftests/bpf/progs/test_thp_adjust_failure.c
new file mode 100644
index 000000000000..0742886eeddd
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_thp_adjust_failure.c
@@ -0,0 +1,25 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+#include "bpf_misc.h"
+
+char _license[] SEC("license") = "GPL";
+
+SEC("struct_ops/get_suggested_order")
+__failure __msg("Unreleased reference")
+int BPF_PROG(unreleased_task, struct mm_struct *mm, struct vm_area_struct *vma__nullable,
+	     u64 vma_flags, u64 tva_flags, int orders, int retval)
+{
+	struct task_struct *p = bpf_mm_get_task(mm);
+
+	/* The task should be released with bpf_task_release() */
+	return p ? 0 : 1;
+}
+
+SEC(".struct_ops.link")
+struct bpf_thp_ops thp = {
+	.get_suggested_order = (void *)unreleased_task,
+};
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v5 mm-new 1/5] mm: thp: add support for BPF based THP order selection
  2025-08-18  5:55 ` [RFC PATCH v5 mm-new 1/5] mm: thp: add support for " Yafang Shao
@ 2025-08-18 13:17   ` Usama Arif
  2025-08-19  3:08     ` Yafang Shao
  2025-08-19 11:10     ` Gutierrez Asier
  0 siblings, 2 replies; 17+ messages in thread
From: Usama Arif @ 2025-08-18 13:17 UTC (permalink / raw)
  To: Yafang Shao, akpm, david, ziy, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, npache, ryan.roberts, dev.jain, hannes,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes
  Cc: bpf, linux-mm



On 18/08/2025 06:55, Yafang Shao wrote:
> This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
> THP tuning. It includes a hook get_suggested_order() [0], allowing BPF
> programs to influence THP order selection based on factors such as:
> - Workload identity
>   For example, workloads running in specific containers or cgroups.
> - Allocation context
>   Whether the allocation occurs during a page fault, khugepaged, or other
>   paths.
> - System memory pressure
>   (May require new BPF helpers to accurately assess memory pressure.)
> 
> Key Details:
> - Only one BPF program can be attached at a time, but it can be updated
>   dynamically to adjust the policy.
> - Supports automatic mTHP order selection and per-workload THP policies.
> - Only functional when THP is set to madise or always.
> 
> It requires CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION to enable. [1]
> This feature is unstable and may evolve in future kernel versions.
> 
> Link: https://lwn.net/ml/all/9bc57721-5287-416c-aa30-46932d605f63@redhat.com/ [0]
> Link: https://lwn.net/ml/all/dda67ea5-2943-497c-a8e5-d81f0733047d@lucifer.local/ [1]
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  include/linux/huge_mm.h    |  15 +++
>  include/linux/khugepaged.h |  12 ++-
>  mm/Kconfig                 |  12 +++
>  mm/Makefile                |   1 +
>  mm/bpf_thp.c               | 186 +++++++++++++++++++++++++++++++++++++
>  mm/huge_memory.c           |  10 ++
>  mm/khugepaged.c            |  26 +++++-
>  mm/memory.c                |  18 +++-
>  8 files changed, 273 insertions(+), 7 deletions(-)
>  create mode 100644 mm/bpf_thp.c
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 1ac0d06fb3c1..f0c91d7bd267 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -6,6 +6,8 @@
>  
>  #include <linux/fs.h> /* only for vma_is_dax() */
>  #include <linux/kobject.h>
> +#include <linux/pgtable.h>
> +#include <linux/mm.h>
>  
>  vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
>  int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> @@ -56,6 +58,7 @@ enum transparent_hugepage_flag {
>  	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
>  	TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
>  	TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
> +	TRANSPARENT_HUGEPAGE_BPF_ATTACHED,      /* BPF prog is attached */
>  };
>  
>  struct kobject;
> @@ -195,6 +198,18 @@ static inline bool hugepage_global_always(void)
>  			(1<<TRANSPARENT_HUGEPAGE_FLAG);
>  }
>  
> +#ifdef CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION
> +int get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> +			u64 vma_flags, enum tva_type tva_flags, int orders);
> +#else
> +static inline int
> +get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> +		    u64 vma_flags, enum tva_type tva_flags, int orders)
> +{
> +	return orders;
> +}
> +#endif
> +
>  static inline int highest_order(unsigned long orders)
>  {
>  	return fls_long(orders) - 1;
> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> index eb1946a70cff..d81c1228a21f 100644
> --- a/include/linux/khugepaged.h
> +++ b/include/linux/khugepaged.h
> @@ -4,6 +4,8 @@
>  
>  #include <linux/mm.h>
>  
> +#include <linux/huge_mm.h>
> +
>  extern unsigned int khugepaged_max_ptes_none __read_mostly;
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  extern struct attribute_group khugepaged_attr_group;
> @@ -22,7 +24,15 @@ extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>  
>  static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
>  {
> -	if (mm_flags_test(MMF_VM_HUGEPAGE, oldmm))
> +	/*
> +	 * THP allocation policy can be dynamically modified via BPF. Even if a
> +	 * task was allowed to allocate THPs, BPF can decide whether its forked
> +	 * child can allocate THPs.
> +	 *
> +	 * The MMF_VM_HUGEPAGE flag will be cleared by khugepaged.
> +	 */
> +	if (mm_flags_test(MMF_VM_HUGEPAGE, oldmm) &&
> +		get_suggested_order(mm, NULL, 0, -1, BIT(PMD_ORDER)))

Hi Yafang,

From the coverletter, one of the potential usecases you are trying to solve for is if global policy
is "never", but the workload want THPs (either always or on madvise basis). But over here,
MMF_VM_HUGEPAGE will never be set so in that case mm_flags_test(MMF_VM_HUGEPAGE, oldmm) will
always evaluate to false and the get_sugested_order call doesnt matter?



>  		__khugepaged_enter(mm);
>  }
>  
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 4108bcd96784..d10089e3f181 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -924,6 +924,18 @@ config NO_PAGE_MAPCOUNT
>  
>  	  EXPERIMENTAL because the impact of some changes is still unclear.
>  
> +config EXPERIMENTAL_BPF_ORDER_SELECTION
> +	bool "BPF-based THP order selection (EXPERIMENTAL)"
> +	depends on TRANSPARENT_HUGEPAGE && BPF_SYSCALL
> +
> +	help
> +	  Enable dynamic THP order selection using BPF programs. This
> +	  experimental feature allows custom BPF logic to determine optimal
> +	  transparent hugepage allocation sizes at runtime.
> +
> +	  Warning: This feature is unstable and may change in future kernel
> +	  versions.
> +


I know there was a discussion on this earlier, but my opinion is that putting all of this
as experiment with warnings is not great. No one will be able to deploy this in production
if its going to be removed, and I believe thats where the real usage is.

>  endif # TRANSPARENT_HUGEPAGE
>  
>  # simple helper to make the code a bit easier to read
> diff --git a/mm/Makefile b/mm/Makefile
> index ef54aa615d9d..cb55d1509be1 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -99,6 +99,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
>  obj-$(CONFIG_NUMA) += memory-tiers.o
>  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> +obj-$(CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION) += bpf_thp.o
>  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
>  obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
>  obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
> diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
> new file mode 100644
> index 000000000000..2b03539452d1
> --- /dev/null
> +++ b/mm/bpf_thp.c
> @@ -0,0 +1,186 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/bpf.h>
> +#include <linux/btf.h>
> +#include <linux/huge_mm.h>
> +#include <linux/khugepaged.h>
> +
> +struct bpf_thp_ops {
> +	/**
> +	 * @get_suggested_order: Get the suggested THP orders for allocation
> +	 * @mm: mm_struct associated with the THP allocation
> +	 * @vma__nullable: vm_area_struct associated with the THP allocation (may be NULL)
> +	 *                 When NULL, the decision should be based on @mm (i.e., when
> +	 *                 triggered from an mm-scope hook rather than a VMA-specific
> +	 *                 context).
> +	 *                 Must belong to @mm (guaranteed by the caller).
> +	 * @vma_flags: use these vm_flags instead of @vma->vm_flags (0 if @vma is NULL)
> +	 * @tva_flags: TVA flags for current @vma (-1 if @vma is NULL)
> +	 * @orders: Bitmask of requested THP orders for this allocation
> +	 *          - PMD-mapped allocation if PMD_ORDER is set
> +	 *          - mTHP allocation otherwise
> +	 *
> +	 * Rerurn: Bitmask of suggested THP orders for allocation. The highest
> +	 *         suggested order will not exceed the highest requested order
> +	 *         in @orders.

If we want to make this generic enough so that it doesnt change, should we allow suggested order to
exceed highest requested order?

> +	 */
> +	int (*get_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> +				   u64 vma_flags, enum tva_type tva_flags, int orders) __rcu;
> +};
> +
> +static struct bpf_thp_ops bpf_thp;
> +static DEFINE_SPINLOCK(thp_ops_lock);
> +
> +int get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> +			u64 vma_flags, enum tva_type tva_flags, int orders)
> +{
> +	int (*bpf_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> +				   u64 vma_flags, enum tva_type tva_flags, int orders);
> +	int suggested_orders = orders;
> +
> +	/* No BPF program is attached */
> +	if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> +		      &transparent_hugepage_flags))
> +		return suggested_orders;
> +
> +	rcu_read_lock();
> +	bpf_suggested_order = rcu_dereference(bpf_thp.get_suggested_order);
> +	if (!bpf_suggested_order)
> +		goto out;


My rcu API knowledge is not the best, but maybe we could do:

if (!rcu_access_pointer(bpf_thp.get_suggested_order))
	return suggested_orders;

rcu_read_lock();
bpf_suggested_order = rcu_dereference(bpf_thp.get_suggested_order);

I believe this might be better as you dont acquire the rcu_read_lock and avoid
the lockdep checks when you do rcu_access_pointer, but I might be wrong
and hope you or someone on the mailing list corrects me if thats the case :)

> +
> +	suggested_orders = bpf_suggested_order(mm, vma__nullable, vma_flags, tva_flags, orders);
> +	if (highest_order(suggested_orders) > highest_order(orders))
> +		suggested_orders = orders;
> +

Maybe we should allow suggested_order to be greater than order if we want this to be truly generic?
Not a suggestion to change, just to have a discussion.

> +out:
> +	rcu_read_unlock();
> +	return suggested_orders;
> +}
> +
> +static bool bpf_thp_ops_is_valid_access(int off, int size,
> +					enum bpf_access_type type,
> +					const struct bpf_prog *prog,
> +					struct bpf_insn_access_aux *info)
> +{
> +	return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
> +}
> +
> +static const struct bpf_func_proto *
> +bpf_thp_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> +{
> +	return bpf_base_func_proto(func_id, prog);
> +}
> +
> +static const struct bpf_verifier_ops thp_bpf_verifier_ops = {
> +	.get_func_proto = bpf_thp_get_func_proto,
> +	.is_valid_access = bpf_thp_ops_is_valid_access,
> +};
> +
> +static int bpf_thp_init(struct btf *btf)
> +{
> +	return 0;
> +}
> +
> +static int bpf_thp_init_member(const struct btf_type *t,
> +			       const struct btf_member *member,
> +			       void *kdata, const void *udata)
> +{
> +	return 0;
> +}
> +
> +static int bpf_thp_reg(void *kdata, struct bpf_link *link)
> +{
> +	struct bpf_thp_ops *ops = kdata;
> +
> +	spin_lock(&thp_ops_lock);
> +	if (test_and_set_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> +		&transparent_hugepage_flags)) {
> +		spin_unlock(&thp_ops_lock);
> +		return -EBUSY;
> +	}
> +	WARN_ON_ONCE(bpf_thp.get_suggested_order);

Should it be WARN_ON_ONCE(rcu_access_pointer(bpf_thp.get_suggested_order)) ?

> +	WRITE_ONCE(bpf_thp.get_suggested_order, ops->get_suggested_order);

Should it be rcu_assign_pointer instead of WRITE_ONCE?

> +	spin_unlock(&thp_ops_lock);
> +	return 0;
> +}
> +
> +static void bpf_thp_unreg(void *kdata, struct bpf_link *link)
> +{
> +	spin_lock(&thp_ops_lock);
> +	clear_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, &transparent_hugepage_flags);
> +	WARN_ON_ONCE(!bpf_thp.get_suggested_order);

Maybe need to use use rcu_access_pointer here in the warning?

> +	rcu_replace_pointer(bpf_thp.get_suggested_order, NULL, lockdep_is_held(&thp_ops_lock));
> +	spin_unlock(&thp_ops_lock);
> +
> +	synchronize_rcu();
> +}
> +
> +static int bpf_thp_update(void *kdata, void *old_kdata, struct bpf_link *link)
> +{
> +	struct bpf_thp_ops *ops = kdata;
> +	struct bpf_thp_ops *old = old_kdata;
> +	int ret = 0;
> +
> +	if (!ops || !old)
> +		return -EINVAL;
> +
> +	spin_lock(&thp_ops_lock);
> +	/* The prog has aleady been removed. */
> +	if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, &transparent_hugepage_flags)) {
> +		ret = -ENOENT;
> +		goto out;
> +	}
> +	WARN_ON_ONCE(!bpf_thp.get_suggested_order);

As above about using rcu_access_pointer.

> +	rcu_replace_pointer(bpf_thp.get_suggested_order, ops->get_suggested_order,
> +			    lockdep_is_held(&thp_ops_lock));
> +
> +out:
> +	spin_unlock(&thp_ops_lock);
> +	if (!ret)
> +		synchronize_rcu();
> +	return ret;
> +}
> +
> +static int bpf_thp_validate(void *kdata)
> +{
> +	struct bpf_thp_ops *ops = kdata;
> +
> +	if (!ops->get_suggested_order) {
> +		pr_err("bpf_thp: required ops isn't implemented\n");
> +		return -EINVAL;
> +	}
> +	return 0;
> +}
> +
> +static int suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> +			   u64 vma_flags, enum tva_type vm_flags, int orders)
> +{
> +	return orders;
> +}
> +
> +static struct bpf_thp_ops __bpf_thp_ops = {
> +	.get_suggested_order = suggested_order,
> +};
> +
> +static struct bpf_struct_ops bpf_bpf_thp_ops = {
> +	.verifier_ops = &thp_bpf_verifier_ops,
> +	.init = bpf_thp_init,
> +	.init_member = bpf_thp_init_member,
> +	.reg = bpf_thp_reg,
> +	.unreg = bpf_thp_unreg,
> +	.update = bpf_thp_update,
> +	.validate = bpf_thp_validate,
> +	.cfi_stubs = &__bpf_thp_ops,
> +	.owner = THIS_MODULE,
> +	.name = "bpf_thp_ops",
> +};
> +
> +static int __init bpf_thp_ops_init(void)
> +{
> +	int err = register_bpf_struct_ops(&bpf_bpf_thp_ops, bpf_thp_ops);
> +
> +	if (err)
> +		pr_err("bpf_thp: Failed to register struct_ops (%d)\n", err);
> +	return err;
> +}
> +late_initcall(bpf_thp_ops_init);
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d89992b65acc..bd8f8f34ab3c 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1349,6 +1349,16 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>  		return ret;
>  	khugepaged_enter_vma(vma, vma->vm_flags);
>  
> +	/*
> +	 * This check must occur after khugepaged_enter_vma() because:
> +	 * 1. We may permit THP allocation via khugepaged
> +	 * 2. While simultaneously disallowing THP allocation
> +	 *    during page fault handling
> +	 */
> +	if (get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_PAGEFAULT, BIT(PMD_ORDER)) !=
> +				BIT(PMD_ORDER))
> +		return VM_FAULT_FALLBACK;
> +
>  	if (!(vmf->flags & FAULT_FLAG_WRITE) &&
>  			!mm_forbids_zeropage(vma->vm_mm) &&
>  			transparent_hugepage_use_zero_page()) {
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index d3d4f116e14b..935583626db6 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -474,7 +474,9 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
>  {
>  	if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
>  	    hugepage_pmd_enabled()) {
> -		if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
> +		if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER) &&
> +		    get_suggested_order(vma->vm_mm, vma, vm_flags, TVA_KHUGEPAGED,
> +					BIT(PMD_ORDER)))
>  			__khugepaged_enter(vma->vm_mm);
>  	}
>  }
> @@ -934,6 +936,8 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
>  		return SCAN_ADDRESS_RANGE;
>  	if (!thp_vma_allowable_order(vma, vma->vm_flags, type, PMD_ORDER))
>  		return SCAN_VMA_CHECK;
> +	if (!get_suggested_order(vma->vm_mm, vma, vma->vm_flags, type, BIT(PMD_ORDER)))
> +		return SCAN_VMA_CHECK;
>  	/*
>  	 * Anon VMA expected, the address may be unmapped then
>  	 * remapped to file after khugepaged reaquired the mmap_lock.
> @@ -1465,6 +1469,11 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
>  		/* khugepaged_mm_lock actually not necessary for the below */
>  		mm_slot_free(mm_slot_cache, mm_slot);
>  		mmdrop(mm);
> +	} else if (!get_suggested_order(mm, NULL, 0, -1, BIT(PMD_ORDER))) {
> +		hash_del(&slot->hash);
> +		list_del(&slot->mm_node);
> +		mm_flags_clear(MMF_VM_HUGEPAGE, mm);
> +		mm_slot_free(mm_slot_cache, mm_slot);
>  	}
>  }
>  
> @@ -1538,6 +1547,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>  	if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
>  		return SCAN_VMA_CHECK;
>  
> +	if (!get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_FORCED_COLLAPSE,
> +				 BIT(PMD_ORDER)))
> +		return SCAN_VMA_CHECK;
>  	/* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
>  	if (userfaultfd_wp(vma))
>  		return SCAN_PTE_UFFD_WP;
> @@ -2416,6 +2428,10 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>  	 * the next mm on the list.
>  	 */
>  	vma = NULL;
> +
> +	/* If this mm is not suitable for the scan list, we should remove it. */
> +	if (!get_suggested_order(mm, NULL, 0, -1, BIT(PMD_ORDER)))
> +		goto breakouterloop_mmap_lock;
>  	if (unlikely(!mmap_read_trylock(mm)))
>  		goto breakouterloop_mmap_lock;
>  
> @@ -2432,7 +2448,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>  			progress++;
>  			break;
>  		}
> -		if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
> +		if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER) ||
> +		    !get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_KHUGEPAGED,
> +					 BIT(PMD_ORDER))) {
>  skip:
>  			progress++;
>  			continue;
> @@ -2769,6 +2787,10 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>  	if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
>  		return -EINVAL;
>  
> +	if (!get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_FORCED_COLLAPSE,
> +				 BIT(PMD_ORDER)))
> +		return -EINVAL;
> +
>  	cc = kmalloc(sizeof(*cc), GFP_KERNEL);
>  	if (!cc)
>  		return -ENOMEM;
> diff --git a/mm/memory.c b/mm/memory.c
> index d9de6c056179..0178857aa058 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4486,6 +4486,7 @@ static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset,
>  static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
> +	int order, suggested_orders;
>  	unsigned long orders;
>  	struct folio *folio;
>  	unsigned long addr;
> @@ -4493,7 +4494,6 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>  	spinlock_t *ptl;
>  	pte_t *pte;
>  	gfp_t gfp;
> -	int order;
>  
>  	/*
>  	 * If uffd is active for the vma we need per-page fault fidelity to
> @@ -4510,13 +4510,18 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>  	if (!zswap_never_enabled())
>  		goto fallback;
>  
> +	suggested_orders = get_suggested_order(vma->vm_mm, vma, vma->vm_flags,
> +					       TVA_PAGEFAULT,
> +					       BIT(PMD_ORDER) - 1);
> +	if (!suggested_orders)
> +		goto fallback;
>  	entry = pte_to_swp_entry(vmf->orig_pte);
>  	/*
>  	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
>  	 * and suitable for swapping THP.
>  	 */
>  	orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT,
> -					  BIT(PMD_ORDER) - 1);
> +					  suggested_orders);
>  	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
>  	orders = thp_swap_suitable_orders(swp_offset(entry),
>  					  vmf->address, orders);
> @@ -5044,12 +5049,12 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	int order, suggested_orders;
>  	unsigned long orders;
>  	struct folio *folio;
>  	unsigned long addr;
>  	pte_t *pte;
>  	gfp_t gfp;
> -	int order;
>  
>  	/*
>  	 * If uffd is active for the vma we need per-page fault fidelity to
> @@ -5058,13 +5063,18 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>  	if (unlikely(userfaultfd_armed(vma)))
>  		goto fallback;
>  
> +	suggested_orders = get_suggested_order(vma->vm_mm, vma, vma->vm_flags,
> +					       TVA_PAGEFAULT,
> +					       BIT(PMD_ORDER) - 1);
> +	if (!suggested_orders)
> +		goto fallback;
>  	/*
>  	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
>  	 * for this vma. Then filter out the orders that can't be allocated over
>  	 * the faulting address and still be fully contained in the vma.
>  	 */
>  	orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT,
> -					  BIT(PMD_ORDER) - 1);
> +					  suggested_orders);
>  	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
>  
>  	if (!orders)




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v5 mm-new 5/5] selftest/bpf: add selftest for BPF based THP order seletection
  2025-08-18  5:55 ` [RFC PATCH v5 mm-new 5/5] selftest/bpf: add selftest for BPF based THP order seletection Yafang Shao
@ 2025-08-18 14:00   ` Usama Arif
  2025-08-19  3:09     ` Yafang Shao
  0 siblings, 1 reply; 17+ messages in thread
From: Usama Arif @ 2025-08-18 14:00 UTC (permalink / raw)
  To: Yafang Shao, akpm, david, ziy, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, npache, ryan.roberts, dev.jain, hannes,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes
  Cc: bpf, linux-mm



On 18/08/2025 06:55, Yafang Shao wrote:
> This self-test verifies that PMD-mapped THP allocation is restricted in
> page faults for tasks within a specific cgroup, while still permitting
> THP allocation via khugepaged.
> 
> Since THP allocation depends on various factors (e.g., system memory
> pressure), using the actual allocated THP size for validation is
> unreliable. Instead, we check the return value of get_suggested_order(),
> which indicates whether the system intends to allocate a THP, regardless of
> whether the allocation ultimately succeeds.
> 
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  tools/testing/selftests/bpf/config            |   3 +
>  .../selftests/bpf/prog_tests/thp_adjust.c     | 224 ++++++++++++++++++
>  .../selftests/bpf/progs/test_thp_adjust.c     |  76 ++++++
>  .../bpf/progs/test_thp_adjust_failure.c       |  25 ++
>  4 files changed, 328 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_failure.c
> 

I think would be good to add selftests to make sure the bpf programs are working
after fork/exec as intended.




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v5 mm-new 0/5] mm, bpf: BPF based THP order selection
  2025-08-18  5:55 [RFC PATCH v5 mm-new 0/5] mm, bpf: BPF based THP order selection Yafang Shao
                   ` (4 preceding siblings ...)
  2025-08-18  5:55 ` [RFC PATCH v5 mm-new 5/5] selftest/bpf: add selftest for BPF based THP order seletection Yafang Shao
@ 2025-08-18 14:35 ` Usama Arif
  2025-08-19  2:41   ` Yafang Shao
  5 siblings, 1 reply; 17+ messages in thread
From: Usama Arif @ 2025-08-18 14:35 UTC (permalink / raw)
  To: Yafang Shao, akpm, david, ziy, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, npache, ryan.roberts, dev.jain, hannes,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes
  Cc: bpf, linux-mm



On 18/08/2025 06:55, Yafang Shao wrote:
> Background
> ----------
> 
> Our production servers consistently configure THP to "never" due to
> historical incidents caused by its behavior. Key issues include:
> - Increased Memory Consumption
>   THP significantly raises overall memory usage, reducing available memory
>   for workloads.
> 
> - Latency Spikes
>   Random latency spikes occur due to frequent memory compaction triggered
>   by THP.
> 
> - Lack of Fine-Grained Control
>   THP tuning is globally configured, making it unsuitable for containerized
>   environments. When multiple workloads share a host, enabling THP without
>   per-workload control leads to unpredictable behavior.
> 
> Due to these issues, administrators avoid switching to madvise or always
> modes—unless per-workload THP control is implemented.
> 
> To address this, we propose BPF-based THP policy for flexible adjustment.
> Additionally, as David mentioned [0], this mechanism can also serve as a
> policy prototyping tool (test policies via BPF before upstreaming them).

Hi Yafang,

A few points:

The link [0] is mentioned a couple of times in the coverletter, but it doesnt seem
to be anywhere in the coverletter.

I am probably missing something over here, but the current version won't accomplish
the usecase you have described at the start of the coverletter and are aiming for, right?
i.e. THP global policy "never", but get hugepages on an madvise or always basis.
I think there was a new THP mode introduced in some earlier revision where you can switch to it
from "never" and then you can use bpf programs with it, but its not in this revision?
It might be useful to add your specific usecase as a selftest.

Do we have some numbers on what the overhead of calling the bpf program is in the
pagefault path as its a critical path?

I remember there was a discussion on this in the earlier revisions, and I have mentioned this in patch 1
as well, but I think making this feature experimental with warnings might not be a great idea.
It could lead to 2 paths:
- people don't deploy this in their fleet because its marked as experimental and they dont want
their machines to break once they upgrade the kernel and this is changed. We will have a difficult
time improving upon this as this is just going to be used for prototyping and won't be driven by
production data.
- people are careless and deploy it in on their production machines, and you get reports that this
has broken after kernel upgrades (despite being marked as experimental :)).
This is just my opinion (which can be wrong :)), but I think we should try and have this merged
as a stable interface that won't change. There might be bugs reported down the line, but I am hoping
we can get the interface of get_suggested_order right in the first implementation that gets merged? 


Thanks!
Usama> 
> Proposed Solution
> -----------------
> 
> As suggested by David [0], we introduce a new BPF interface:
> 
> /**
>  * @get_suggested_order: Get the suggested THP orders for allocation
>  * @mm: mm_struct associated with the THP allocation
>  * @vma__nullable: vm_area_struct associated with the THP allocation (may be NULL)
>  *                 When NULL, the decision should be based on @mm (i.e., when
>  *                 triggered from an mm-scope hook rather than a VMA-specific
>  *                 context).
>  *                 Must belong to @mm (guaranteed by the caller).
>  * @vma_flags: use these vm_flags instead of @vma->vm_flags (0 if @vma is NULL)
>  * @tva_flags: TVA flags for current @vma (-1 if @vma is NULL)
>  * @orders: Bitmask of requested THP orders for this allocation
>  *          - PMD-mapped allocation if PMD_ORDER is set
>  *          - mTHP allocation otherwise
>  *
>  * Rerurn: Bitmask of suggested THP orders for allocation. The highest
>  *         suggested order will not exceed the highest requested order
>  *         in @orders.
>  */
>  int (*get_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> 			    u64 vma_flags, enum tva_type tva_flags, int orders) __rcu;
> 
> This interface:
> - Supports both use cases (per-workload tuning + policy prototyping).
> - Can be extended with BPF helpers (e.g., for memory pressure awareness).
> 
> This is an experimental feature. To use it, you must enable
> CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION.
> 
> Warning:
> - The interface may change
> - Behavior may differ in future kernel versions
> - We might remove it in the future
> 
> A simple test case is included in Patch #4.
> 
> Future work:
> - Extend it to File THP
> 
> Changes:
> RFC v4->v5:
> - Add support for vma (David)
> - Add mTHP support in khugepaged (Zi)
> - Use bitmask of all allowed orders instead (Zi)
> - Retrieve the page size and PMD order rather than hardcoding them (Zi)
> 
> RFC v3->v4: https://lwn.net/Articles/1031829/
> - Use a new interface get_suggested_order() (David)
> - Mark it as experimental (David, Lorenzo)
> - Code improvement in THP (Usama)
> - Code improvement in BPF struct ops (Amery)
> 
> RFC v2->v3: https://lwn.net/Articles/1024545/
> - Finer-graind tuning based on madvise or always mode (David, Lorenzo)
> - Use BPF to write more advanced policies logic (David, Lorenzo)
> 
> RFC v1->v2: https://lwn.net/Articles/1021783/
> The main changes are as follows,
> - Use struct_ops instead of fmod_ret (Alexei)
> - Introduce a new THP mode (Johannes)
> - Introduce new helpers for BPF hook (Zi)
> - Refine the commit log
> 
> RFC v1: https://lwn.net/Articles/1019290/
> Yafang Shao (5):
>   mm: thp: add support for BPF based THP order selection
>   mm: thp: add a new kfunc bpf_mm_get_mem_cgroup()
>   mm: thp: add a new kfunc bpf_mm_get_task()
>   bpf: mark vma->vm_mm as trusted
>   selftest/bpf: add selftest for BPF based THP order seletection
> 
>  include/linux/huge_mm.h                       |  15 +
>  include/linux/khugepaged.h                    |  12 +-
>  kernel/bpf/verifier.c                         |   5 +
>  mm/Kconfig                                    |  12 +
>  mm/Makefile                                   |   1 +
>  mm/bpf_thp.c                                  | 269 ++++++++++++++++++
>  mm/huge_memory.c                              |  10 +
>  mm/khugepaged.c                               |  26 +-
>  mm/memory.c                                   |  18 +-
>  tools/testing/selftests/bpf/config            |   3 +
>  .../selftests/bpf/prog_tests/thp_adjust.c     | 224 +++++++++++++++
>  .../selftests/bpf/progs/test_thp_adjust.c     |  76 +++++
>  .../bpf/progs/test_thp_adjust_failure.c       |  25 ++
>  13 files changed, 689 insertions(+), 7 deletions(-)
>  create mode 100644 mm/bpf_thp.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_failure.c
> 



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v5 mm-new 0/5] mm, bpf: BPF based THP order selection
  2025-08-18 14:35 ` [RFC PATCH v5 mm-new 0/5] mm, bpf: BPF based THP order selection Usama Arif
@ 2025-08-19  2:41   ` Yafang Shao
  2025-08-19 10:44     ` Usama Arif
  0 siblings, 1 reply; 17+ messages in thread
From: Yafang Shao @ 2025-08-19  2:41 UTC (permalink / raw)
  To: Usama Arif
  Cc: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, gutierrez.asier, willy,
	ast, daniel, andrii, ameryhung, rientjes, bpf, linux-mm

On Mon, Aug 18, 2025 at 10:35 PM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 18/08/2025 06:55, Yafang Shao wrote:
> > Background
> > ----------
> >
> > Our production servers consistently configure THP to "never" due to
> > historical incidents caused by its behavior. Key issues include:
> > - Increased Memory Consumption
> >   THP significantly raises overall memory usage, reducing available memory
> >   for workloads.
> >
> > - Latency Spikes
> >   Random latency spikes occur due to frequent memory compaction triggered
> >   by THP.
> >
> > - Lack of Fine-Grained Control
> >   THP tuning is globally configured, making it unsuitable for containerized
> >   environments. When multiple workloads share a host, enabling THP without
> >   per-workload control leads to unpredictable behavior.
> >
> > Due to these issues, administrators avoid switching to madvise or always
> > modes—unless per-workload THP control is implemented.
> >
> > To address this, we propose BPF-based THP policy for flexible adjustment.
> > Additionally, as David mentioned [0], this mechanism can also serve as a
> > policy prototyping tool (test policies via BPF before upstreaming them).
>
> Hi Yafang,
>
> A few points:
>
> The link [0] is mentioned a couple of times in the coverletter, but it doesnt seem
> to be anywhere in the coverletter.

Oops, my bad.

>
> I am probably missing something over here, but the current version won't accomplish
> the usecase you have described at the start of the coverletter and are aiming for, right?
> i.e. THP global policy "never", but get hugepages on an madvise or always basis.

In "never" mode, THP allocation is entirely disabled (except via
MADV_COLLAPSE). However, we can achieve the same behavior—and
more—using a BPF program, even in "madvise" or "always" mode. Instead
of introducing a new THP mode, we dynamically enforce policy via BPF.

Deployment Steps in our production servers:

1. Initial Setup:
- Set THP mode to "never" (disabling THP by default).
- Attach the BPF program and pin the BPF maps and links.
- Pinning ensures persistence (like a kernel module), preventing
disruption under system pressure.
- A THP whitelist map tracks allowed cgroups (initially empty → no THP
allocations).

2. Enable THP Control:
- Switch THP mode to "always" or "madvise" (BPF now governs actual allocations).

3. Dynamic Management:
- To permit THP for a cgroup, add its ID to the whitelist map.
- To revoke permission, remove the cgroup ID from the map.
- The BPF program can be updated live (policy adjustments require no
task interruption).

> I think there was a new THP mode introduced in some earlier revision where you can switch to it
> from "never" and then you can use bpf programs with it, but its not in this revision?
> It might be useful to add your specific usecase as a selftest.
>
> Do we have some numbers on what the overhead of calling the bpf program is in the
> pagefault path as its a critical path?

In our current implementation, THP allocation occurs during the page
fault path. As such, I have not yet evaluated performance for this
specific case.
The overhead is expected to be workload-dependent, primarily influenced by:
- Memory availability: The presence (or absence) of higher-order free pages
- System pressure: Contention for memory compaction, NUMA balancing,
or direct reclaim

>
> I remember there was a discussion on this in the earlier revisions, and I have mentioned this in patch 1
> as well, but I think making this feature experimental with warnings might not be a great idea.

The experimental status of this feature was requested by David and
Lorenzo, who likely have specific technical considerations behind this
requirement.

> It could lead to 2 paths:
> - people don't deploy this in their fleet because its marked as experimental and they dont want
> their machines to break once they upgrade the kernel and this is changed. We will have a difficult
> time improving upon this as this is just going to be used for prototyping and won't be driven by
> production data.
> - people are careless and deploy it in on their production machines, and you get reports that this
> has broken after kernel upgrades (despite being marked as experimental :)).
> This is just my opinion (which can be wrong :)), but I think we should try and have this merged
> as a stable interface that won't change. There might be bugs reported down the line, but I am hoping
> we can get the interface of get_suggested_order right in the first implementation that gets merged?

We may eventually remove the experimental status or deprecate this
feature entirely, depending on its adoption. However, the first
critical step is to make it available for broader usage and
evaluation.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v5 mm-new 1/5] mm: thp: add support for BPF based THP order selection
  2025-08-18 13:17   ` Usama Arif
@ 2025-08-19  3:08     ` Yafang Shao
  2025-08-19 10:11       ` Usama Arif
  2025-08-19 11:10     ` Gutierrez Asier
  1 sibling, 1 reply; 17+ messages in thread
From: Yafang Shao @ 2025-08-19  3:08 UTC (permalink / raw)
  To: Usama Arif
  Cc: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, gutierrez.asier, willy,
	ast, daniel, andrii, ameryhung, rientjes, bpf, linux-mm

> Hi Yafang,
>
> From the coverletter, one of the potential usecases you are trying to solve for is if global policy
> is "never", but the workload want THPs (either always or on madvise basis). But over here,
> MMF_VM_HUGEPAGE will never be set so in that case mm_flags_test(MMF_VM_HUGEPAGE, oldmm) will
> always evaluate to false and the get_sugested_order call doesnt matter?

See the replyment in another thread.

>
>
>
> >               __khugepaged_enter(mm);
> >  }
> >
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 4108bcd96784..d10089e3f181 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -924,6 +924,18 @@ config NO_PAGE_MAPCOUNT
> >
> >         EXPERIMENTAL because the impact of some changes is still unclear.
> >
> > +config EXPERIMENTAL_BPF_ORDER_SELECTION
> > +     bool "BPF-based THP order selection (EXPERIMENTAL)"
> > +     depends on TRANSPARENT_HUGEPAGE && BPF_SYSCALL
> > +
> > +     help
> > +       Enable dynamic THP order selection using BPF programs. This
> > +       experimental feature allows custom BPF logic to determine optimal
> > +       transparent hugepage allocation sizes at runtime.
> > +
> > +       Warning: This feature is unstable and may change in future kernel
> > +       versions.
> > +
>
>
> I know there was a discussion on this earlier, but my opinion is that putting all of this
> as experiment with warnings is not great. No one will be able to deploy this in production
> if its going to be removed, and I believe thats where the real usage is.

See the replyment in another thread.

>
> >  endif # TRANSPARENT_HUGEPAGE
> >
> >  # simple helper to make the code a bit easier to read
> > diff --git a/mm/Makefile b/mm/Makefile
> > index ef54aa615d9d..cb55d1509be1 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -99,6 +99,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
> >  obj-$(CONFIG_NUMA) += memory-tiers.o
> >  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
> >  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> > +obj-$(CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION) += bpf_thp.o
> >  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
> >  obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
> >  obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
> > diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
> > new file mode 100644
> > index 000000000000..2b03539452d1
> > --- /dev/null
> > +++ b/mm/bpf_thp.c
> > @@ -0,0 +1,186 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +#include <linux/bpf.h>
> > +#include <linux/btf.h>
> > +#include <linux/huge_mm.h>
> > +#include <linux/khugepaged.h>
> > +
> > +struct bpf_thp_ops {
> > +     /**
> > +      * @get_suggested_order: Get the suggested THP orders for allocation
> > +      * @mm: mm_struct associated with the THP allocation
> > +      * @vma__nullable: vm_area_struct associated with the THP allocation (may be NULL)
> > +      *                 When NULL, the decision should be based on @mm (i.e., when
> > +      *                 triggered from an mm-scope hook rather than a VMA-specific
> > +      *                 context).
> > +      *                 Must belong to @mm (guaranteed by the caller).
> > +      * @vma_flags: use these vm_flags instead of @vma->vm_flags (0 if @vma is NULL)
> > +      * @tva_flags: TVA flags for current @vma (-1 if @vma is NULL)
> > +      * @orders: Bitmask of requested THP orders for this allocation
> > +      *          - PMD-mapped allocation if PMD_ORDER is set
> > +      *          - mTHP allocation otherwise
> > +      *
> > +      * Rerurn: Bitmask of suggested THP orders for allocation. The highest
> > +      *         suggested order will not exceed the highest requested order
> > +      *         in @orders.
>
> If we want to make this generic enough so that it doesnt change, should we allow suggested order to
> exceed highest requested order?

The maximum requested order is determined by the callsite. For example:
- PMD-mapped THP uses PMD_ORDER
- mTHP uses (PMD_ORDER - 1)

We must respect this upper bound to avoid undefined behavior.

>
> > +      */
> > +     int (*get_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> > +                                u64 vma_flags, enum tva_type tva_flags, int orders) __rcu;
> > +};
> > +
> > +static struct bpf_thp_ops bpf_thp;
> > +static DEFINE_SPINLOCK(thp_ops_lock);
> > +
> > +int get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> > +                     u64 vma_flags, enum tva_type tva_flags, int orders)
> > +{
> > +     int (*bpf_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> > +                                u64 vma_flags, enum tva_type tva_flags, int orders);
> > +     int suggested_orders = orders;
> > +
> > +     /* No BPF program is attached */
> > +     if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> > +                   &transparent_hugepage_flags))
> > +             return suggested_orders;
> > +
> > +     rcu_read_lock();
> > +     bpf_suggested_order = rcu_dereference(bpf_thp.get_suggested_order);
> > +     if (!bpf_suggested_order)
> > +             goto out;
>
>
> My rcu API knowledge is not the best, but maybe we could do:
>
> if (!rcu_access_pointer(bpf_thp.get_suggested_order))
>         return suggested_orders;
>

There might be a race here.  The current rcu_access_pointer() check
occurs outside the RCU read-side critical section, meaning the
protected pointer could be freed between the check and use.
Therefore, we must perform the NULL check within the RCU read critical
section when dereferencing the pointer:

> rcu_read_lock();
> bpf_suggested_order = rcu_dereference(bpf_thp.get_suggested_order);

 if (!bpf_suggested_order)
     goto out;

>
> I believe this might be better as you dont acquire the rcu_read_lock and avoid
> the lockdep checks when you do rcu_access_pointer, but I might be wrong
> and hope you or someone on the mailing list corrects me if thats the case :)
>
> > +
> > +     suggested_orders = bpf_suggested_order(mm, vma__nullable, vma_flags, tva_flags, orders);
> > +     if (highest_order(suggested_orders) > highest_order(orders))
> > +             suggested_orders = orders;
> > +
>
> Maybe we should allow suggested_order to be greater than order if we want this to be truly generic?
> Not a suggestion to change, just to have a discussion.

As replied above, the upper bound is a limitation.

>
> > +out:
> > +     rcu_read_unlock();
> > +     return suggested_orders;
> > +}
> > +
> > +static bool bpf_thp_ops_is_valid_access(int off, int size,
> > +                                     enum bpf_access_type type,
> > +                                     const struct bpf_prog *prog,
> > +                                     struct bpf_insn_access_aux *info)
> > +{
> > +     return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
> > +}
> > +
> > +static const struct bpf_func_proto *
> > +bpf_thp_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> > +{
> > +     return bpf_base_func_proto(func_id, prog);
> > +}
> > +
> > +static const struct bpf_verifier_ops thp_bpf_verifier_ops = {
> > +     .get_func_proto = bpf_thp_get_func_proto,
> > +     .is_valid_access = bpf_thp_ops_is_valid_access,
> > +};
> > +
> > +static int bpf_thp_init(struct btf *btf)
> > +{
> > +     return 0;
> > +}
> > +
> > +static int bpf_thp_init_member(const struct btf_type *t,
> > +                            const struct btf_member *member,
> > +                            void *kdata, const void *udata)
> > +{
> > +     return 0;
> > +}
> > +
> > +static int bpf_thp_reg(void *kdata, struct bpf_link *link)
> > +{
> > +     struct bpf_thp_ops *ops = kdata;
> > +
> > +     spin_lock(&thp_ops_lock);
> > +     if (test_and_set_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> > +             &transparent_hugepage_flags)) {
> > +             spin_unlock(&thp_ops_lock);
> > +             return -EBUSY;
> > +     }
> > +     WARN_ON_ONCE(bpf_thp.get_suggested_order);
>
> Should it be WARN_ON_ONCE(rcu_access_pointer(bpf_thp.get_suggested_order)) ?

Nice catch.  I'll make that change

>
> > +     WRITE_ONCE(bpf_thp.get_suggested_order, ops->get_suggested_order);
>
> Should it be rcu_assign_pointer instead of WRITE_ONCE?

will change it.

>
> > +     spin_unlock(&thp_ops_lock);
> > +     return 0;
> > +}
> > +
> > +static void bpf_thp_unreg(void *kdata, struct bpf_link *link)
> > +{
> > +     spin_lock(&thp_ops_lock);
> > +     clear_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, &transparent_hugepage_flags);
> > +     WARN_ON_ONCE(!bpf_thp.get_suggested_order);
>
> Maybe need to use use rcu_access_pointer here in the warning?

Agreed, will change it.

>
> > +     rcu_replace_pointer(bpf_thp.get_suggested_order, NULL, lockdep_is_held(&thp_ops_lock));
> > +     spin_unlock(&thp_ops_lock);
> > +
> > +     synchronize_rcu();
> > +}
> > +
> > +static int bpf_thp_update(void *kdata, void *old_kdata, struct bpf_link *link)
> > +{
> > +     struct bpf_thp_ops *ops = kdata;
> > +     struct bpf_thp_ops *old = old_kdata;
> > +     int ret = 0;
> > +
> > +     if (!ops || !old)
> > +             return -EINVAL;
> > +
> > +     spin_lock(&thp_ops_lock);
> > +     /* The prog has aleady been removed. */
> > +     if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, &transparent_hugepage_flags)) {
> > +             ret = -ENOENT;
> > +             goto out;
> > +     }
> > +     WARN_ON_ONCE(!bpf_thp.get_suggested_order);
>
> As above about using rcu_access_pointer.

will change it.

Thanks for your review!

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v5 mm-new 5/5] selftest/bpf: add selftest for BPF based THP order seletection
  2025-08-18 14:00   ` Usama Arif
@ 2025-08-19  3:09     ` Yafang Shao
  0 siblings, 0 replies; 17+ messages in thread
From: Yafang Shao @ 2025-08-19  3:09 UTC (permalink / raw)
  To: Usama Arif
  Cc: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, gutierrez.asier, willy,
	ast, daniel, andrii, ameryhung, rientjes, bpf, linux-mm

On Mon, Aug 18, 2025 at 10:00 PM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 18/08/2025 06:55, Yafang Shao wrote:
> > This self-test verifies that PMD-mapped THP allocation is restricted in
> > page faults for tasks within a specific cgroup, while still permitting
> > THP allocation via khugepaged.
> >
> > Since THP allocation depends on various factors (e.g., system memory
> > pressure), using the actual allocated THP size for validation is
> > unreliable. Instead, we check the return value of get_suggested_order(),
> > which indicates whether the system intends to allocate a THP, regardless of
> > whether the allocation ultimately succeeds.
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  tools/testing/selftests/bpf/config            |   3 +
> >  .../selftests/bpf/prog_tests/thp_adjust.c     | 224 ++++++++++++++++++
> >  .../selftests/bpf/progs/test_thp_adjust.c     |  76 ++++++
> >  .../bpf/progs/test_thp_adjust_failure.c       |  25 ++
> >  4 files changed, 328 insertions(+)
> >  create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
> >  create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c
> >  create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_failure.c
> >
>
> I think would be good to add selftests to make sure the bpf programs are working
> after fork/exec as intended.

Good suggestion. I will add a self test for it.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v5 mm-new 1/5] mm: thp: add support for BPF based THP order selection
  2025-08-19  3:08     ` Yafang Shao
@ 2025-08-19 10:11       ` Usama Arif
  0 siblings, 0 replies; 17+ messages in thread
From: Usama Arif @ 2025-08-19 10:11 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, gutierrez.asier, willy,
	ast, daniel, andrii, ameryhung, rientjes, bpf, linux-mm



On 19/08/2025 04:08, Yafang Shao wrote:
>> Hi Yafang,
>>
>> From the coverletter, one of the potential usecases you are trying to solve for is if global policy
>> is "never", but the workload want THPs (either always or on madvise basis). But over here,
>> MMF_VM_HUGEPAGE will never be set so in that case mm_flags_test(MMF_VM_HUGEPAGE, oldmm) will
>> always evaluate to false and the get_sugested_order call doesnt matter?
> 
> See the replyment in another thread.
> 
>>
>>
>>
>>>               __khugepaged_enter(mm);
>>>  }
>>>
>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>> index 4108bcd96784..d10089e3f181 100644
>>> --- a/mm/Kconfig
>>> +++ b/mm/Kconfig
>>> @@ -924,6 +924,18 @@ config NO_PAGE_MAPCOUNT
>>>
>>>         EXPERIMENTAL because the impact of some changes is still unclear.
>>>
>>> +config EXPERIMENTAL_BPF_ORDER_SELECTION
>>> +     bool "BPF-based THP order selection (EXPERIMENTAL)"
>>> +     depends on TRANSPARENT_HUGEPAGE && BPF_SYSCALL
>>> +
>>> +     help
>>> +       Enable dynamic THP order selection using BPF programs. This
>>> +       experimental feature allows custom BPF logic to determine optimal
>>> +       transparent hugepage allocation sizes at runtime.
>>> +
>>> +       Warning: This feature is unstable and may change in future kernel
>>> +       versions.
>>> +
>>
>>
>> I know there was a discussion on this earlier, but my opinion is that putting all of this
>> as experiment with warnings is not great. No one will be able to deploy this in production
>> if its going to be removed, and I believe thats where the real usage is.
> 
> See the replyment in another thread.
> 
>>
>>>  endif # TRANSPARENT_HUGEPAGE
>>>
>>>  # simple helper to make the code a bit easier to read
>>> diff --git a/mm/Makefile b/mm/Makefile
>>> index ef54aa615d9d..cb55d1509be1 100644
>>> --- a/mm/Makefile
>>> +++ b/mm/Makefile
>>> @@ -99,6 +99,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
>>>  obj-$(CONFIG_NUMA) += memory-tiers.o
>>>  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>>>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
>>> +obj-$(CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION) += bpf_thp.o
>>>  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
>>>  obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
>>>  obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
>>> diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
>>> new file mode 100644
>>> index 000000000000..2b03539452d1
>>> --- /dev/null
>>> +++ b/mm/bpf_thp.c
>>> @@ -0,0 +1,186 @@
>>> +// SPDX-License-Identifier: GPL-2.0
>>> +
>>> +#include <linux/bpf.h>
>>> +#include <linux/btf.h>
>>> +#include <linux/huge_mm.h>
>>> +#include <linux/khugepaged.h>
>>> +
>>> +struct bpf_thp_ops {
>>> +     /**
>>> +      * @get_suggested_order: Get the suggested THP orders for allocation
>>> +      * @mm: mm_struct associated with the THP allocation
>>> +      * @vma__nullable: vm_area_struct associated with the THP allocation (may be NULL)
>>> +      *                 When NULL, the decision should be based on @mm (i.e., when
>>> +      *                 triggered from an mm-scope hook rather than a VMA-specific
>>> +      *                 context).
>>> +      *                 Must belong to @mm (guaranteed by the caller).
>>> +      * @vma_flags: use these vm_flags instead of @vma->vm_flags (0 if @vma is NULL)
>>> +      * @tva_flags: TVA flags for current @vma (-1 if @vma is NULL)
>>> +      * @orders: Bitmask of requested THP orders for this allocation
>>> +      *          - PMD-mapped allocation if PMD_ORDER is set
>>> +      *          - mTHP allocation otherwise
>>> +      *
>>> +      * Rerurn: Bitmask of suggested THP orders for allocation. The highest
>>> +      *         suggested order will not exceed the highest requested order
>>> +      *         in @orders.
>>
>> If we want to make this generic enough so that it doesnt change, should we allow suggested order to
>> exceed highest requested order?
> 
> The maximum requested order is determined by the callsite. For example:
> - PMD-mapped THP uses PMD_ORDER
> - mTHP uses (PMD_ORDER - 1)
> 
> We must respect this upper bound to avoid undefined behavior.

Ack, makes sense.
> 
>>
>>> +      */
>>> +     int (*get_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
>>> +                                u64 vma_flags, enum tva_type tva_flags, int orders) __rcu;
>>> +};
>>> +
>>> +static struct bpf_thp_ops bpf_thp;
>>> +static DEFINE_SPINLOCK(thp_ops_lock);
>>> +
>>> +int get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
>>> +                     u64 vma_flags, enum tva_type tva_flags, int orders)
>>> +{
>>> +     int (*bpf_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
>>> +                                u64 vma_flags, enum tva_type tva_flags, int orders);
>>> +     int suggested_orders = orders;
>>> +
>>> +     /* No BPF program is attached */
>>> +     if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
>>> +                   &transparent_hugepage_flags))
>>> +             return suggested_orders;
>>> +
>>> +     rcu_read_lock();
>>> +     bpf_suggested_order = rcu_dereference(bpf_thp.get_suggested_order);
>>> +     if (!bpf_suggested_order)
>>> +             goto out;
>>
>>
>> My rcu API knowledge is not the best, but maybe we could do:
>>
>> if (!rcu_access_pointer(bpf_thp.get_suggested_order))
>>         return suggested_orders;
>>
> 
> There might be a race here.  The current rcu_access_pointer() check
> occurs outside the RCU read-side critical section, meaning the
> protected pointer could be freed between the check and use.
> Therefore, we must perform the NULL check within the RCU read critical
> section when dereferencing the pointer:
> 

Ack, makes sense.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v5 mm-new 0/5] mm, bpf: BPF based THP order selection
  2025-08-19  2:41   ` Yafang Shao
@ 2025-08-19 10:44     ` Usama Arif
  2025-08-19 11:33       ` Yafang Shao
  0 siblings, 1 reply; 17+ messages in thread
From: Usama Arif @ 2025-08-19 10:44 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, gutierrez.asier, willy,
	ast, daniel, andrii, ameryhung, rientjes, bpf, linux-mm



On 19/08/2025 03:41, Yafang Shao wrote:
> On Mon, Aug 18, 2025 at 10:35 PM Usama Arif <usamaarif642@gmail.com> wrote:
>>
>>
>>
>> On 18/08/2025 06:55, Yafang Shao wrote:
>>> Background
>>> ----------
>>>
>>> Our production servers consistently configure THP to "never" due to
>>> historical incidents caused by its behavior. Key issues include:
>>> - Increased Memory Consumption
>>>   THP significantly raises overall memory usage, reducing available memory
>>>   for workloads.
>>>
>>> - Latency Spikes
>>>   Random latency spikes occur due to frequent memory compaction triggered
>>>   by THP.
>>>
>>> - Lack of Fine-Grained Control
>>>   THP tuning is globally configured, making it unsuitable for containerized
>>>   environments. When multiple workloads share a host, enabling THP without
>>>   per-workload control leads to unpredictable behavior.
>>>
>>> Due to these issues, administrators avoid switching to madvise or always
>>> modes—unless per-workload THP control is implemented.
>>>
>>> To address this, we propose BPF-based THP policy for flexible adjustment.
>>> Additionally, as David mentioned [0], this mechanism can also serve as a
>>> policy prototyping tool (test policies via BPF before upstreaming them).
>>
>> Hi Yafang,
>>
>> A few points:
>>
>> The link [0] is mentioned a couple of times in the coverletter, but it doesnt seem
>> to be anywhere in the coverletter.
> 
> Oops, my bad.
> 
>>
>> I am probably missing something over here, but the current version won't accomplish
>> the usecase you have described at the start of the coverletter and are aiming for, right?
>> i.e. THP global policy "never", but get hugepages on an madvise or always basis.
> 
> In "never" mode, THP allocation is entirely disabled (except via
> MADV_COLLAPSE). However, we can achieve the same behavior—and
> more—using a BPF program, even in "madvise" or "always" mode. Instead
> of introducing a new THP mode, we dynamically enforce policy via BPF.
> 
> Deployment Steps in our production servers:
> 
> 1. Initial Setup:
> - Set THP mode to "never" (disabling THP by default).
> - Attach the BPF program and pin the BPF maps and links.
> - Pinning ensures persistence (like a kernel module), preventing
> disruption under system pressure.
> - A THP whitelist map tracks allowed cgroups (initially empty → no THP
> allocations).
> 
> 2. Enable THP Control:
> - Switch THP mode to "always" or "madvise" (BPF now governs actual allocations).


Ah ok, so I was missing this part. With this solution you will still have to change
the system policy to madvise or always, and then basically disable THP for everyone apart
from the cgroups that want it?

> 
> 3. Dynamic Management:
> - To permit THP for a cgroup, add its ID to the whitelist map.
> - To revoke permission, remove the cgroup ID from the map.
> - The BPF program can be updated live (policy adjustments require no
> task interruption).
> 
>> I think there was a new THP mode introduced in some earlier revision where you can switch to it
>> from "never" and then you can use bpf programs with it, but its not in this revision?
>> It might be useful to add your specific usecase as a selftest.
>>
>> Do we have some numbers on what the overhead of calling the bpf program is in the
>> pagefault path as its a critical path?
> 
> In our current implementation, THP allocation occurs during the page
> fault path. As such, I have not yet evaluated performance for this
> specific case.
> The overhead is expected to be workload-dependent, primarily influenced by:
> - Memory availability: The presence (or absence) of higher-order free pages
> - System pressure: Contention for memory compaction, NUMA balancing,
> or direct reclaim
> 

Yes, I think might be worth seeing if perf indicates that you are spending more time
in __handle_mm_fault with this series + bpf program attached compared to without?

>>
>> I remember there was a discussion on this in the earlier revisions, and I have mentioned this in patch 1
>> as well, but I think making this feature experimental with warnings might not be a great idea.
> 
> The experimental status of this feature was requested by David and
> Lorenzo, who likely have specific technical considerations behind this
> requirement.
> 
>> It could lead to 2 paths:
>> - people don't deploy this in their fleet because its marked as experimental and they dont want
>> their machines to break once they upgrade the kernel and this is changed. We will have a difficult
>> time improving upon this as this is just going to be used for prototyping and won't be driven by
>> production data.
>> - people are careless and deploy it in on their production machines, and you get reports that this
>> has broken after kernel upgrades (despite being marked as experimental :)).
>> This is just my opinion (which can be wrong :)), but I think we should try and have this merged
>> as a stable interface that won't change. There might be bugs reported down the line, but I am hoping
>> we can get the interface of get_suggested_order right in the first implementation that gets merged?
> 
> We may eventually remove the experimental status or deprecate this
> feature entirely, depending on its adoption. However, the first
> critical step is to make it available for broader usage and
> evaluation.
> 



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v5 mm-new 1/5] mm: thp: add support for BPF based THP order selection
  2025-08-18 13:17   ` Usama Arif
  2025-08-19  3:08     ` Yafang Shao
@ 2025-08-19 11:10     ` Gutierrez Asier
  2025-08-19 11:43       ` Yafang Shao
  1 sibling, 1 reply; 17+ messages in thread
From: Gutierrez Asier @ 2025-08-19 11:10 UTC (permalink / raw)
  To: Usama Arif, Yafang Shao, akpm, david, ziy, baolin.wang,
	lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, dev.jain,
	hannes, willy, ast, daniel, andrii, ameryhung, rientjes
  Cc: bpf, linux-mm

Hi,

On 8/18/2025 4:17 PM, Usama Arif wrote:
> 
> 
> On 18/08/2025 06:55, Yafang Shao wrote:
>> This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
>> THP tuning. It includes a hook get_suggested_order() [0], allowing BPF
>> programs to influence THP order selection based on factors such as:
>> - Workload identity
>>   For example, workloads running in specific containers or cgroups.
>> - Allocation context
>>   Whether the allocation occurs during a page fault, khugepaged, or other
>>   paths.
>> - System memory pressure
>>   (May require new BPF helpers to accurately assess memory pressure.)
>>
>> Key Details:
>> - Only one BPF program can be attached at a time, but it can be updated
>>   dynamically to adjust the policy.
>> - Supports automatic mTHP order selection and per-workload THP policies.
>> - Only functional when THP is set to madise or always.
>>
>> It requires CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION to enable. [1]
>> This feature is unstable and may evolve in future kernel versions.
>>
>> Link: https://lwn.net/ml/all/9bc57721-5287-416c-aa30-46932d605f63@redhat.com/ [0]
>> Link: https://lwn.net/ml/all/dda67ea5-2943-497c-a8e5-d81f0733047d@lucifer.local/ [1]
>>
>> Suggested-by: David Hildenbrand <david@redhat.com>
>> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
>> ---
>>  include/linux/huge_mm.h    |  15 +++
>>  include/linux/khugepaged.h |  12 ++-
>>  mm/Kconfig                 |  12 +++
>>  mm/Makefile                |   1 +
>>  mm/bpf_thp.c               | 186 +++++++++++++++++++++++++++++++++++++
>>  mm/huge_memory.c           |  10 ++
>>  mm/khugepaged.c            |  26 +++++-
>>  mm/memory.c                |  18 +++-
>>  8 files changed, 273 insertions(+), 7 deletions(-)
>>  create mode 100644 mm/bpf_thp.c
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 1ac0d06fb3c1..f0c91d7bd267 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -6,6 +6,8 @@
>>  
>>  #include <linux/fs.h> /* only for vma_is_dax() */
>>  #include <linux/kobject.h>
>> +#include <linux/pgtable.h>
>> +#include <linux/mm.h>
>>  
>>  vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
>>  int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>> @@ -56,6 +58,7 @@ enum transparent_hugepage_flag {
>>  	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
>>  	TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
>>  	TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
>> +	TRANSPARENT_HUGEPAGE_BPF_ATTACHED,      /* BPF prog is attached */
>>  };
>>  
>>  struct kobject;
>> @@ -195,6 +198,18 @@ static inline bool hugepage_global_always(void)
>>  			(1<<TRANSPARENT_HUGEPAGE_FLAG);
>>  }
>>  
>> +#ifdef CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION
>> +int get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
>> +			u64 vma_flags, enum tva_type tva_flags, int orders);
>> +#else
>> +static inline int
>> +get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
>> +		    u64 vma_flags, enum tva_type tva_flags, int orders)
>> +{
>> +	return orders;
>> +}
>> +#endif
>> +
>>  static inline int highest_order(unsigned long orders)
>>  {
>>  	return fls_long(orders) - 1;
>> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
>> index eb1946a70cff..d81c1228a21f 100644
>> --- a/include/linux/khugepaged.h
>> +++ b/include/linux/khugepaged.h
>> @@ -4,6 +4,8 @@
>>  
>>  #include <linux/mm.h>
>>  
>> +#include <linux/huge_mm.h>
>> +
>>  extern unsigned int khugepaged_max_ptes_none __read_mostly;
>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>  extern struct attribute_group khugepaged_attr_group;
>> @@ -22,7 +24,15 @@ extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>>  
>>  static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
>>  {
>> -	if (mm_flags_test(MMF_VM_HUGEPAGE, oldmm))
>> +	/*
>> +	 * THP allocation policy can be dynamically modified via BPF. Even if a
>> +	 * task was allowed to allocate THPs, BPF can decide whether its forked
>> +	 * child can allocate THPs.
>> +	 *
>> +	 * The MMF_VM_HUGEPAGE flag will be cleared by khugepaged.
>> +	 */
>> +	if (mm_flags_test(MMF_VM_HUGEPAGE, oldmm) &&
>> +		get_suggested_order(mm, NULL, 0, -1, BIT(PMD_ORDER)))
> 
> Hi Yafang,
> 
> From the coverletter, one of the potential usecases you are trying to solve for is if global policy
> is "never", but the workload want THPs (either always or on madvise basis). But over here,
> MMF_VM_HUGEPAGE will never be set so in that case mm_flags_test(MMF_VM_HUGEPAGE, oldmm) will
> always evaluate to false and the get_sugested_order call doesnt matter?
> 
> 
> 
>>  		__khugepaged_enter(mm);
>>  }
>>  
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 4108bcd96784..d10089e3f181 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -924,6 +924,18 @@ config NO_PAGE_MAPCOUNT
>>  
>>  	  EXPERIMENTAL because the impact of some changes is still unclear.
>>  
>> +config EXPERIMENTAL_BPF_ORDER_SELECTION
>> +	bool "BPF-based THP order selection (EXPERIMENTAL)"
>> +	depends on TRANSPARENT_HUGEPAGE && BPF_SYSCALL
>> +
>> +	help
>> +	  Enable dynamic THP order selection using BPF programs. This
>> +	  experimental feature allows custom BPF logic to determine optimal
>> +	  transparent hugepage allocation sizes at runtime.
>> +
>> +	  Warning: This feature is unstable and may change in future kernel
>> +	  versions.
>> +
> 
> 
> I know there was a discussion on this earlier, but my opinion is that putting all of this
> as experiment with warnings is not great. No one will be able to deploy this in production
> if its going to be removed, and I believe thats where the real usage is.
> 
If the goal is to deploy it in Kubernetes, I believe eBPF is the wrong way to do it. Right 
now eBPF is used mainly for networking (CNI).

Kubernetes currently has something called Dynamic Resource Allocation (DRA), which is already 
in alpha version. It's main use is to share GPU and TPU among many pods. Still, we should 
take into account how likely the user space is going to use eBPF for controlling resources and 
how it integrates with the mechanisms currently available for resource control by the user 
space.

There is another scenario, when you you have a number of pods and a limit of huge pages you 
want to use among them. Something similar to HugeTLBfs. Could this be achieved with your 
ebpf implementation?

>>  endif # TRANSPARENT_HUGEPAGE
>>  
>>  # simple helper to make the code a bit easier to read
>> diff --git a/mm/Makefile b/mm/Makefile
>> index ef54aa615d9d..cb55d1509be1 100644
>> --- a/mm/Makefile
>> +++ b/mm/Makefile
>> @@ -99,6 +99,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
>>  obj-$(CONFIG_NUMA) += memory-tiers.o
>>  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
>> +obj-$(CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION) += bpf_thp.o
>>  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
>>  obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
>>  obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
>> diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
>> new file mode 100644
>> index 000000000000..2b03539452d1
>> --- /dev/null
>> +++ b/mm/bpf_thp.c
>> @@ -0,0 +1,186 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +
>> +#include <linux/bpf.h>
>> +#include <linux/btf.h>
>> +#include <linux/huge_mm.h>
>> +#include <linux/khugepaged.h>
>> +
>> +struct bpf_thp_ops {
>> +	/**
>> +	 * @get_suggested_order: Get the suggested THP orders for allocation
>> +	 * @mm: mm_struct associated with the THP allocation
>> +	 * @vma__nullable: vm_area_struct associated with the THP allocation (may be NULL)
>> +	 *                 When NULL, the decision should be based on @mm (i.e., when
>> +	 *                 triggered from an mm-scope hook rather than a VMA-specific
>> +	 *                 context).
>> +	 *                 Must belong to @mm (guaranteed by the caller).
>> +	 * @vma_flags: use these vm_flags instead of @vma->vm_flags (0 if @vma is NULL)
>> +	 * @tva_flags: TVA flags for current @vma (-1 if @vma is NULL)
>> +	 * @orders: Bitmask of requested THP orders for this allocation
>> +	 *          - PMD-mapped allocation if PMD_ORDER is set
>> +	 *          - mTHP allocation otherwise
>> +	 *
>> +	 * Rerurn: Bitmask of suggested THP orders for allocation. The highest
>> +	 *         suggested order will not exceed the highest requested order
>> +	 *         in @orders.
> 
> If we want to make this generic enough so that it doesnt change, should we allow suggested order to
> exceed highest requested order?
> 
>> +	 */
>> +	int (*get_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
>> +				   u64 vma_flags, enum tva_type tva_flags, int orders) __rcu;
>> +};
>> +
>> +static struct bpf_thp_ops bpf_thp;
>> +static DEFINE_SPINLOCK(thp_ops_lock);
>> +
>> +int get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
>> +			u64 vma_flags, enum tva_type tva_flags, int orders)
>> +{
>> +	int (*bpf_suggested_order)(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
>> +				   u64 vma_flags, enum tva_type tva_flags, int orders);
>> +	int suggested_orders = orders;
>> +
>> +	/* No BPF program is attached */
>> +	if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
>> +		      &transparent_hugepage_flags))
>> +		return suggested_orders;
>> +
>> +	rcu_read_lock();
>> +	bpf_suggested_order = rcu_dereference(bpf_thp.get_suggested_order);
>> +	if (!bpf_suggested_order)
>> +		goto out;
> 
> 
> My rcu API knowledge is not the best, but maybe we could do:
> 
> if (!rcu_access_pointer(bpf_thp.get_suggested_order))
> 	return suggested_orders;
> 
> rcu_read_lock();
> bpf_suggested_order = rcu_dereference(bpf_thp.get_suggested_order);
> 
> I believe this might be better as you dont acquire the rcu_read_lock and avoid
> the lockdep checks when you do rcu_access_pointer, but I might be wrong
> and hope you or someone on the mailing list corrects me if thats the case :)
> 
>> +
>> +	suggested_orders = bpf_suggested_order(mm, vma__nullable, vma_flags, tva_flags, orders);
>> +	if (highest_order(suggested_orders) > highest_order(orders))
>> +		suggested_orders = orders;
>> +
> 
> Maybe we should allow suggested_order to be greater than order if we want this to be truly generic?
> Not a suggestion to change, just to have a discussion.
> 
>> +out:
>> +	rcu_read_unlock();
>> +	return suggested_orders;
>> +}
>> +
>> +static bool bpf_thp_ops_is_valid_access(int off, int size,
>> +					enum bpf_access_type type,
>> +					const struct bpf_prog *prog,
>> +					struct bpf_insn_access_aux *info)
>> +{
>> +	return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
>> +}
>> +
>> +static const struct bpf_func_proto *
>> +bpf_thp_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>> +{
>> +	return bpf_base_func_proto(func_id, prog);
>> +}
>> +
>> +static const struct bpf_verifier_ops thp_bpf_verifier_ops = {
>> +	.get_func_proto = bpf_thp_get_func_proto,
>> +	.is_valid_access = bpf_thp_ops_is_valid_access,
>> +};
>> +
>> +static int bpf_thp_init(struct btf *btf)
>> +{
>> +	return 0;
>> +}
>> +
>> +static int bpf_thp_init_member(const struct btf_type *t,
>> +			       const struct btf_member *member,
>> +			       void *kdata, const void *udata)
>> +{
>> +	return 0;
>> +}
>> +
>> +static int bpf_thp_reg(void *kdata, struct bpf_link *link)
>> +{
>> +	struct bpf_thp_ops *ops = kdata;
>> +
>> +	spin_lock(&thp_ops_lock);
>> +	if (test_and_set_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
>> +		&transparent_hugepage_flags)) {
>> +		spin_unlock(&thp_ops_lock);
>> +		return -EBUSY;
>> +	}
>> +	WARN_ON_ONCE(bpf_thp.get_suggested_order);
> 
> Should it be WARN_ON_ONCE(rcu_access_pointer(bpf_thp.get_suggested_order)) ?
> 
>> +	WRITE_ONCE(bpf_thp.get_suggested_order, ops->get_suggested_order);
> 
> Should it be rcu_assign_pointer instead of WRITE_ONCE?
> 
>> +	spin_unlock(&thp_ops_lock);
>> +	return 0;
>> +}
>> +
>> +static void bpf_thp_unreg(void *kdata, struct bpf_link *link)
>> +{
>> +	spin_lock(&thp_ops_lock);
>> +	clear_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, &transparent_hugepage_flags);
>> +	WARN_ON_ONCE(!bpf_thp.get_suggested_order);
> 
> Maybe need to use use rcu_access_pointer here in the warning?
> 
>> +	rcu_replace_pointer(bpf_thp.get_suggested_order, NULL, lockdep_is_held(&thp_ops_lock));
>> +	spin_unlock(&thp_ops_lock);
>> +
>> +	synchronize_rcu();
>> +}
>> +
>> +static int bpf_thp_update(void *kdata, void *old_kdata, struct bpf_link *link)
>> +{
>> +	struct bpf_thp_ops *ops = kdata;
>> +	struct bpf_thp_ops *old = old_kdata;
>> +	int ret = 0;
>> +
>> +	if (!ops || !old)
>> +		return -EINVAL;
>> +
>> +	spin_lock(&thp_ops_lock);
>> +	/* The prog has aleady been removed. */
>> +	if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, &transparent_hugepage_flags)) {
>> +		ret = -ENOENT;
>> +		goto out;
>> +	}
>> +	WARN_ON_ONCE(!bpf_thp.get_suggested_order);
> 
> As above about using rcu_access_pointer.
> 
>> +	rcu_replace_pointer(bpf_thp.get_suggested_order, ops->get_suggested_order,
>> +			    lockdep_is_held(&thp_ops_lock));
>> +
>> +out:
>> +	spin_unlock(&thp_ops_lock);
>> +	if (!ret)
>> +		synchronize_rcu();
>> +	return ret;
>> +}
>> +
>> +static int bpf_thp_validate(void *kdata)
>> +{
>> +	struct bpf_thp_ops *ops = kdata;
>> +
>> +	if (!ops->get_suggested_order) {
>> +		pr_err("bpf_thp: required ops isn't implemented\n");
>> +		return -EINVAL;
>> +	}
>> +	return 0;
>> +}
>> +
>> +static int suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
>> +			   u64 vma_flags, enum tva_type vm_flags, int orders)
>> +{
>> +	return orders;
>> +}
>> +
>> +static struct bpf_thp_ops __bpf_thp_ops = {
>> +	.get_suggested_order = suggested_order,
>> +};
>> +
>> +static struct bpf_struct_ops bpf_bpf_thp_ops = {
>> +	.verifier_ops = &thp_bpf_verifier_ops,
>> +	.init = bpf_thp_init,
>> +	.init_member = bpf_thp_init_member,
>> +	.reg = bpf_thp_reg,
>> +	.unreg = bpf_thp_unreg,
>> +	.update = bpf_thp_update,
>> +	.validate = bpf_thp_validate,
>> +	.cfi_stubs = &__bpf_thp_ops,
>> +	.owner = THIS_MODULE,
>> +	.name = "bpf_thp_ops",
>> +};
>> +
>> +static int __init bpf_thp_ops_init(void)
>> +{
>> +	int err = register_bpf_struct_ops(&bpf_bpf_thp_ops, bpf_thp_ops);
>> +
>> +	if (err)
>> +		pr_err("bpf_thp: Failed to register struct_ops (%d)\n", err);
>> +	return err;
>> +}
>> +late_initcall(bpf_thp_ops_init);
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index d89992b65acc..bd8f8f34ab3c 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1349,6 +1349,16 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>>  		return ret;
>>  	khugepaged_enter_vma(vma, vma->vm_flags);
>>  
>> +	/*
>> +	 * This check must occur after khugepaged_enter_vma() because:
>> +	 * 1. We may permit THP allocation via khugepaged
>> +	 * 2. While simultaneously disallowing THP allocation
>> +	 *    during page fault handling
>> +	 */
>> +	if (get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_PAGEFAULT, BIT(PMD_ORDER)) !=
>> +				BIT(PMD_ORDER))
>> +		return VM_FAULT_FALLBACK;
>> +
>>  	if (!(vmf->flags & FAULT_FLAG_WRITE) &&
>>  			!mm_forbids_zeropage(vma->vm_mm) &&
>>  			transparent_hugepage_use_zero_page()) {
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index d3d4f116e14b..935583626db6 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -474,7 +474,9 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
>>  {
>>  	if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
>>  	    hugepage_pmd_enabled()) {
>> -		if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
>> +		if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER) &&
>> +		    get_suggested_order(vma->vm_mm, vma, vm_flags, TVA_KHUGEPAGED,
>> +					BIT(PMD_ORDER)))
>>  			__khugepaged_enter(vma->vm_mm);
>>  	}
>>  }
>> @@ -934,6 +936,8 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
>>  		return SCAN_ADDRESS_RANGE;
>>  	if (!thp_vma_allowable_order(vma, vma->vm_flags, type, PMD_ORDER))
>>  		return SCAN_VMA_CHECK;
>> +	if (!get_suggested_order(vma->vm_mm, vma, vma->vm_flags, type, BIT(PMD_ORDER)))
>> +		return SCAN_VMA_CHECK;
>>  	/*
>>  	 * Anon VMA expected, the address may be unmapped then
>>  	 * remapped to file after khugepaged reaquired the mmap_lock.
>> @@ -1465,6 +1469,11 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
>>  		/* khugepaged_mm_lock actually not necessary for the below */
>>  		mm_slot_free(mm_slot_cache, mm_slot);
>>  		mmdrop(mm);
>> +	} else if (!get_suggested_order(mm, NULL, 0, -1, BIT(PMD_ORDER))) {
>> +		hash_del(&slot->hash);
>> +		list_del(&slot->mm_node);
>> +		mm_flags_clear(MMF_VM_HUGEPAGE, mm);
>> +		mm_slot_free(mm_slot_cache, mm_slot);
>>  	}
>>  }
>>  
>> @@ -1538,6 +1547,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>>  	if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
>>  		return SCAN_VMA_CHECK;
>>  
>> +	if (!get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_FORCED_COLLAPSE,
>> +				 BIT(PMD_ORDER)))
>> +		return SCAN_VMA_CHECK;
>>  	/* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */
>>  	if (userfaultfd_wp(vma))
>>  		return SCAN_PTE_UFFD_WP;
>> @@ -2416,6 +2428,10 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>>  	 * the next mm on the list.
>>  	 */
>>  	vma = NULL;
>> +
>> +	/* If this mm is not suitable for the scan list, we should remove it. */
>> +	if (!get_suggested_order(mm, NULL, 0, -1, BIT(PMD_ORDER)))
>> +		goto breakouterloop_mmap_lock;
>>  	if (unlikely(!mmap_read_trylock(mm)))
>>  		goto breakouterloop_mmap_lock;
>>  
>> @@ -2432,7 +2448,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>>  			progress++;
>>  			break;
>>  		}
>> -		if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER)) {
>> +		if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_KHUGEPAGED, PMD_ORDER) ||
>> +		    !get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_KHUGEPAGED,
>> +					 BIT(PMD_ORDER))) {
>>  skip:
>>  			progress++;
>>  			continue;
>> @@ -2769,6 +2787,10 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>>  	if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD_ORDER))
>>  		return -EINVAL;
>>  
>> +	if (!get_suggested_order(vma->vm_mm, vma, vma->vm_flags, TVA_FORCED_COLLAPSE,
>> +				 BIT(PMD_ORDER)))
>> +		return -EINVAL;
>> +
>>  	cc = kmalloc(sizeof(*cc), GFP_KERNEL);
>>  	if (!cc)
>>  		return -ENOMEM;
>> diff --git a/mm/memory.c b/mm/memory.c
>> index d9de6c056179..0178857aa058 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4486,6 +4486,7 @@ static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset,
>>  static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>>  {
>>  	struct vm_area_struct *vma = vmf->vma;
>> +	int order, suggested_orders;
>>  	unsigned long orders;
>>  	struct folio *folio;
>>  	unsigned long addr;
>> @@ -4493,7 +4494,6 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>>  	spinlock_t *ptl;
>>  	pte_t *pte;
>>  	gfp_t gfp;
>> -	int order;
>>  
>>  	/*
>>  	 * If uffd is active for the vma we need per-page fault fidelity to
>> @@ -4510,13 +4510,18 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>>  	if (!zswap_never_enabled())
>>  		goto fallback;
>>  
>> +	suggested_orders = get_suggested_order(vma->vm_mm, vma, vma->vm_flags,
>> +					       TVA_PAGEFAULT,
>> +					       BIT(PMD_ORDER) - 1);
>> +	if (!suggested_orders)
>> +		goto fallback;
>>  	entry = pte_to_swp_entry(vmf->orig_pte);
>>  	/*
>>  	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
>>  	 * and suitable for swapping THP.
>>  	 */
>>  	orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT,
>> -					  BIT(PMD_ORDER) - 1);
>> +					  suggested_orders);
>>  	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
>>  	orders = thp_swap_suitable_orders(swp_offset(entry),
>>  					  vmf->address, orders);
>> @@ -5044,12 +5049,12 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>>  {
>>  	struct vm_area_struct *vma = vmf->vma;
>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +	int order, suggested_orders;
>>  	unsigned long orders;
>>  	struct folio *folio;
>>  	unsigned long addr;
>>  	pte_t *pte;
>>  	gfp_t gfp;
>> -	int order;
>>  
>>  	/*
>>  	 * If uffd is active for the vma we need per-page fault fidelity to
>> @@ -5058,13 +5063,18 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>>  	if (unlikely(userfaultfd_armed(vma)))
>>  		goto fallback;
>>  
>> +	suggested_orders = get_suggested_order(vma->vm_mm, vma, vma->vm_flags,
>> +					       TVA_PAGEFAULT,
>> +					       BIT(PMD_ORDER) - 1);
>> +	if (!suggested_orders)
>> +		goto fallback;
>>  	/*
>>  	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
>>  	 * for this vma. Then filter out the orders that can't be allocated over
>>  	 * the faulting address and still be fully contained in the vma.
>>  	 */
>>  	orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT,
>> -					  BIT(PMD_ORDER) - 1);
>> +					  suggested_orders);
>>  	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
>>  
>>  	if (!orders)
> 
> 

-- 
Asier Gutierrez
Huawei



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v5 mm-new 0/5] mm, bpf: BPF based THP order selection
  2025-08-19 10:44     ` Usama Arif
@ 2025-08-19 11:33       ` Yafang Shao
  0 siblings, 0 replies; 17+ messages in thread
From: Yafang Shao @ 2025-08-19 11:33 UTC (permalink / raw)
  To: Usama Arif
  Cc: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, gutierrez.asier, willy,
	ast, daniel, andrii, ameryhung, rientjes, bpf, linux-mm

On Tue, Aug 19, 2025 at 6:44 PM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 19/08/2025 03:41, Yafang Shao wrote:
> > On Mon, Aug 18, 2025 at 10:35 PM Usama Arif <usamaarif642@gmail.com> wrote:
> >>
> >>
> >>
> >> On 18/08/2025 06:55, Yafang Shao wrote:
> >>> Background
> >>> ----------
> >>>
> >>> Our production servers consistently configure THP to "never" due to
> >>> historical incidents caused by its behavior. Key issues include:
> >>> - Increased Memory Consumption
> >>>   THP significantly raises overall memory usage, reducing available memory
> >>>   for workloads.
> >>>
> >>> - Latency Spikes
> >>>   Random latency spikes occur due to frequent memory compaction triggered
> >>>   by THP.
> >>>
> >>> - Lack of Fine-Grained Control
> >>>   THP tuning is globally configured, making it unsuitable for containerized
> >>>   environments. When multiple workloads share a host, enabling THP without
> >>>   per-workload control leads to unpredictable behavior.
> >>>
> >>> Due to these issues, administrators avoid switching to madvise or always
> >>> modes—unless per-workload THP control is implemented.
> >>>
> >>> To address this, we propose BPF-based THP policy for flexible adjustment.
> >>> Additionally, as David mentioned [0], this mechanism can also serve as a
> >>> policy prototyping tool (test policies via BPF before upstreaming them).
> >>
> >> Hi Yafang,
> >>
> >> A few points:
> >>
> >> The link [0] is mentioned a couple of times in the coverletter, but it doesnt seem
> >> to be anywhere in the coverletter.
> >
> > Oops, my bad.
> >
> >>
> >> I am probably missing something over here, but the current version won't accomplish
> >> the usecase you have described at the start of the coverletter and are aiming for, right?
> >> i.e. THP global policy "never", but get hugepages on an madvise or always basis.
> >
> > In "never" mode, THP allocation is entirely disabled (except via
> > MADV_COLLAPSE). However, we can achieve the same behavior—and
> > more—using a BPF program, even in "madvise" or "always" mode. Instead
> > of introducing a new THP mode, we dynamically enforce policy via BPF.
> >
> > Deployment Steps in our production servers:
> >
> > 1. Initial Setup:
> > - Set THP mode to "never" (disabling THP by default).
> > - Attach the BPF program and pin the BPF maps and links.
> > - Pinning ensures persistence (like a kernel module), preventing
> > disruption under system pressure.
> > - A THP whitelist map tracks allowed cgroups (initially empty → no THP
> > allocations).
> >
> > 2. Enable THP Control:
> > - Switch THP mode to "always" or "madvise" (BPF now governs actual allocations).
>
>
> Ah ok, so I was missing this part. With this solution you will still have to change
> the system policy to madvise or always, and then basically disable THP for everyone apart
> from the cgroups that want it?

Right.

>
> >
> > 3. Dynamic Management:
> > - To permit THP for a cgroup, add its ID to the whitelist map.
> > - To revoke permission, remove the cgroup ID from the map.
> > - The BPF program can be updated live (policy adjustments require no
> > task interruption).
> >
> >> I think there was a new THP mode introduced in some earlier revision where you can switch to it
> >> from "never" and then you can use bpf programs with it, but its not in this revision?
> >> It might be useful to add your specific usecase as a selftest.
> >>
> >> Do we have some numbers on what the overhead of calling the bpf program is in the
> >> pagefault path as its a critical path?
> >
> > In our current implementation, THP allocation occurs during the page
> > fault path. As such, I have not yet evaluated performance for this
> > specific case.
> > The overhead is expected to be workload-dependent, primarily influenced by:
> > - Memory availability: The presence (or absence) of higher-order free pages
> > - System pressure: Contention for memory compaction, NUMA balancing,
> > or direct reclaim
> >
>
> Yes, I think might be worth seeing if perf indicates that you are spending more time
> in __handle_mm_fault with this series + bpf program attached compared to without?

I will test it.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC PATCH v5 mm-new 1/5] mm: thp: add support for BPF based THP order selection
  2025-08-19 11:10     ` Gutierrez Asier
@ 2025-08-19 11:43       ` Yafang Shao
  0 siblings, 0 replies; 17+ messages in thread
From: Yafang Shao @ 2025-08-19 11:43 UTC (permalink / raw)
  To: Gutierrez Asier
  Cc: Usama Arif, akpm, david, ziy, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, npache, ryan.roberts, dev.jain, hannes, willy, ast,
	daniel, andrii, ameryhung, rientjes, bpf, linux-mm

On Tue, Aug 19, 2025 at 7:11 PM Gutierrez Asier
<gutierrez.asier@huawei-partners.com> wrote:
>
> Hi,
>
> On 8/18/2025 4:17 PM, Usama Arif wrote:
> >
> >
> > On 18/08/2025 06:55, Yafang Shao wrote:
> >> This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
> >> THP tuning. It includes a hook get_suggested_order() [0], allowing BPF
> >> programs to influence THP order selection based on factors such as:
> >> - Workload identity
> >>   For example, workloads running in specific containers or cgroups.
> >> - Allocation context
> >>   Whether the allocation occurs during a page fault, khugepaged, or other
> >>   paths.
> >> - System memory pressure
> >>   (May require new BPF helpers to accurately assess memory pressure.)
> >>
> >> Key Details:
> >> - Only one BPF program can be attached at a time, but it can be updated
> >>   dynamically to adjust the policy.
> >> - Supports automatic mTHP order selection and per-workload THP policies.
> >> - Only functional when THP is set to madise or always.
> >>
> >> It requires CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION to enable. [1]
> >> This feature is unstable and may evolve in future kernel versions.
> >>
> >> Link: https://lwn.net/ml/all/9bc57721-5287-416c-aa30-46932d605f63@redhat.com/ [0]
> >> Link: https://lwn.net/ml/all/dda67ea5-2943-497c-a8e5-d81f0733047d@lucifer.local/ [1]
> >>
> >> Suggested-by: David Hildenbrand <david@redhat.com>
> >> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> >> ---
> >>  include/linux/huge_mm.h    |  15 +++
> >>  include/linux/khugepaged.h |  12 ++-
> >>  mm/Kconfig                 |  12 +++
> >>  mm/Makefile                |   1 +
> >>  mm/bpf_thp.c               | 186 +++++++++++++++++++++++++++++++++++++
> >>  mm/huge_memory.c           |  10 ++
> >>  mm/khugepaged.c            |  26 +++++-
> >>  mm/memory.c                |  18 +++-
> >>  8 files changed, 273 insertions(+), 7 deletions(-)
> >>  create mode 100644 mm/bpf_thp.c
> >>
> >> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >> index 1ac0d06fb3c1..f0c91d7bd267 100644
> >> --- a/include/linux/huge_mm.h
> >> +++ b/include/linux/huge_mm.h
> >> @@ -6,6 +6,8 @@
> >>
> >>  #include <linux/fs.h> /* only for vma_is_dax() */
> >>  #include <linux/kobject.h>
> >> +#include <linux/pgtable.h>
> >> +#include <linux/mm.h>
> >>
> >>  vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
> >>  int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> >> @@ -56,6 +58,7 @@ enum transparent_hugepage_flag {
> >>      TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
> >>      TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
> >>      TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
> >> +    TRANSPARENT_HUGEPAGE_BPF_ATTACHED,      /* BPF prog is attached */
> >>  };
> >>
> >>  struct kobject;
> >> @@ -195,6 +198,18 @@ static inline bool hugepage_global_always(void)
> >>                      (1<<TRANSPARENT_HUGEPAGE_FLAG);
> >>  }
> >>
> >> +#ifdef CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION
> >> +int get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> >> +                    u64 vma_flags, enum tva_type tva_flags, int orders);
> >> +#else
> >> +static inline int
> >> +get_suggested_order(struct mm_struct *mm, struct vm_area_struct *vma__nullable,
> >> +                u64 vma_flags, enum tva_type tva_flags, int orders)
> >> +{
> >> +    return orders;
> >> +}
> >> +#endif
> >> +
> >>  static inline int highest_order(unsigned long orders)
> >>  {
> >>      return fls_long(orders) - 1;
> >> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> >> index eb1946a70cff..d81c1228a21f 100644
> >> --- a/include/linux/khugepaged.h
> >> +++ b/include/linux/khugepaged.h
> >> @@ -4,6 +4,8 @@
> >>
> >>  #include <linux/mm.h>
> >>
> >> +#include <linux/huge_mm.h>
> >> +
> >>  extern unsigned int khugepaged_max_ptes_none __read_mostly;
> >>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>  extern struct attribute_group khugepaged_attr_group;
> >> @@ -22,7 +24,15 @@ extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> >>
> >>  static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
> >>  {
> >> -    if (mm_flags_test(MMF_VM_HUGEPAGE, oldmm))
> >> +    /*
> >> +     * THP allocation policy can be dynamically modified via BPF. Even if a
> >> +     * task was allowed to allocate THPs, BPF can decide whether its forked
> >> +     * child can allocate THPs.
> >> +     *
> >> +     * The MMF_VM_HUGEPAGE flag will be cleared by khugepaged.
> >> +     */
> >> +    if (mm_flags_test(MMF_VM_HUGEPAGE, oldmm) &&
> >> +            get_suggested_order(mm, NULL, 0, -1, BIT(PMD_ORDER)))
> >
> > Hi Yafang,
> >
> > From the coverletter, one of the potential usecases you are trying to solve for is if global policy
> > is "never", but the workload want THPs (either always or on madvise basis). But over here,
> > MMF_VM_HUGEPAGE will never be set so in that case mm_flags_test(MMF_VM_HUGEPAGE, oldmm) will
> > always evaluate to false and the get_sugested_order call doesnt matter?
> >
> >
> >
> >>              __khugepaged_enter(mm);
> >>  }
> >>
> >> diff --git a/mm/Kconfig b/mm/Kconfig
> >> index 4108bcd96784..d10089e3f181 100644
> >> --- a/mm/Kconfig
> >> +++ b/mm/Kconfig
> >> @@ -924,6 +924,18 @@ config NO_PAGE_MAPCOUNT
> >>
> >>        EXPERIMENTAL because the impact of some changes is still unclear.
> >>
> >> +config EXPERIMENTAL_BPF_ORDER_SELECTION
> >> +    bool "BPF-based THP order selection (EXPERIMENTAL)"
> >> +    depends on TRANSPARENT_HUGEPAGE && BPF_SYSCALL
> >> +
> >> +    help
> >> +      Enable dynamic THP order selection using BPF programs. This
> >> +      experimental feature allows custom BPF logic to determine optimal
> >> +      transparent hugepage allocation sizes at runtime.
> >> +
> >> +      Warning: This feature is unstable and may change in future kernel
> >> +      versions.
> >> +
> >
> >
> > I know there was a discussion on this earlier, but my opinion is that putting all of this
> > as experiment with warnings is not great. No one will be able to deploy this in production
> > if its going to be removed, and I believe thats where the real usage is.
> >
> If the goal is to deploy it in Kubernetes, I believe eBPF is the wrong way to do it. Right
> now eBPF is used mainly for networking (CNI).

As I recall, I've already shared the Kubernetes deployment procedure
with you. [0]
If you’re using k8s, you should definitely check this out.
JFYI, we have already deployed this in our Kubernetes production environment.

[0] https://lore.kernel.org/linux-mm/CALOAHbDJPP499ZDitUYqThAJ_BmpeWN_NVR-wm=8XBe3X7Wxkw@mail.gmail.com/

>
> Kubernetes currently has something called Dynamic Resource Allocation (DRA), which is already
> in alpha version. It's main use is to share GPU and TPU among many pods. Still, we should
> take into account how likely the user space is going to use eBPF for controlling resources and
> how it integrates with the mechanisms currently available for resource control by the user
> space.

This is unrelated to the current feature.

>
> There is another scenario, when you you have a number of pods and a limit of huge pages you
> want to use among them. Something similar to HugeTLBfs. Could this be achieved with your
> ebpf implementation?

This feature focuses on policy adjustment rather than resource control.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-08-19 11:43 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-18  5:55 [RFC PATCH v5 mm-new 0/5] mm, bpf: BPF based THP order selection Yafang Shao
2025-08-18  5:55 ` [RFC PATCH v5 mm-new 1/5] mm: thp: add support for " Yafang Shao
2025-08-18 13:17   ` Usama Arif
2025-08-19  3:08     ` Yafang Shao
2025-08-19 10:11       ` Usama Arif
2025-08-19 11:10     ` Gutierrez Asier
2025-08-19 11:43       ` Yafang Shao
2025-08-18  5:55 ` [RFC PATCH v5 mm-new 2/5] mm: thp: add a new kfunc bpf_mm_get_mem_cgroup() Yafang Shao
2025-08-18  5:55 ` [RFC PATCH v5 mm-new 3/5] mm: thp: add a new kfunc bpf_mm_get_task() Yafang Shao
2025-08-18  5:55 ` [RFC PATCH v5 mm-new 4/5] bpf: mark vma->vm_mm as trusted Yafang Shao
2025-08-18  5:55 ` [RFC PATCH v5 mm-new 5/5] selftest/bpf: add selftest for BPF based THP order seletection Yafang Shao
2025-08-18 14:00   ` Usama Arif
2025-08-19  3:09     ` Yafang Shao
2025-08-18 14:35 ` [RFC PATCH v5 mm-new 0/5] mm, bpf: BPF based THP order selection Usama Arif
2025-08-19  2:41   ` Yafang Shao
2025-08-19 10:44     ` Usama Arif
2025-08-19 11:33       ` Yafang Shao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).