[RFC PATCH v4 0/4] mm, bpf: BPF based THP order selection

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v4 0/4] mm, bpf: BPF based THP order selection
@ 2025-07-29  9:18 Yafang Shao
  2025-07-29  9:18 ` [RFC PATCH v4 1/4] mm: thp: add support for " Yafang Shao
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Yafang Shao @ 2025-07-29  9:18 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung
  Cc: bpf, linux-mm, Yafang Shao

Background
----------

Our production servers consistently configure THP to "never" due to
historical incidents caused by its behavior. Key issues include:
- Increased Memory Consumption
  THP significantly raises overall memory usage, reducing available memory
  for workloads.

- Latency Spikes
  Random latency spikes occur due to frequent memory compaction triggered
  by THP.

- Lack of Fine-Grained Control
  THP tuning is globally configured, making it unsuitable for containerized
  environments. When multiple workloads share a host, enabling THP without
  per-workload control leads to unpredictable behavior.

Due to these issues, administrators avoid switching to madvise or always
modes—unless per-workload THP control is implemented.

To address this, we propose BPF-based THP policy for flexible adjustment.
Additionally, as David mentioned [0], this mechanism can also serve as a
policy prototyping tool (test policies via BPF before upstreaming them).

Proposed Solution
-----------------

As suggested by David [0], we introduce a new BPF interface:

/**
 * @get_suggested_order: Get the suggested highest THP order for allocation
 * @mm: mm_struct associated with the THP allocation
 * @tva_flags: TVA flags for current context
 *             %TVA_IN_PF: Set when in page fault context
 *             Other flags: Reserved for future use
 * @order: The highest order being considered for this THP allocation.
 *         %PUD_ORDER for PUD-mapped allocations
 *         %PMD_ORDER for PMD-mapped allocations
 *         %PMD_ORDER - 1 for mTHP allocations
 *
 * Rerurn: Suggested highest THP order to use for allocation. The returned
 * order will never exceed the input @order value.
 */
int (*get_suggested_order)(struct mm_struct *mm, unsigned long tva_flags, int order);

This interface:
- Supports both use cases (per-workload tuning + policy prototyping).
- Can be extended with BPF helpers (e.g., for memory pressure awareness).

This is an experimental feature. To use it, you must enable
CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION.

Warning:
- The interface may change
- Behavior may differ in future kernel versions
- We might remove it in the future

A simple test case is included in Patch #4.

Changes:
RFC v3->v4:
- Use a new interface get_suggested_order() (David)
- Mark it as experimental (David, Lorenzo)
- Code improvement in THP (Usama)
- Code improvement in BPF struct ops (Amery)

RFC v2->v3: https://lwn.net/Articles/1024545/
- Finer-graind tuning based on madvise or always mode (David, Lorenzo)
- Use BPF to write more advanced policies logic (David, Lorenzo)

RFC v1->v2: https://lwn.net/Articles/1021783/
The main changes are as follows,
- Use struct_ops instead of fmod_ret (Alexei)
- Introduce a new THP mode (Johannes)
- Introduce new helpers for BPF hook (Zi)
- Refine the commit log

RFC v1: https://lwn.net/Articles/1019290/

Yafang Shao (4):
  mm: thp: add support for BPF based THP order selection
  mm: thp: add a new kfunc bpf_mm_get_mem_cgroup()
  mm: thp: add a new kfunc bpf_mm_get_task()
  selftest/bpf: add selftest for BPF based THP order seletection

 include/linux/huge_mm.h                       |  13 +
 include/linux/khugepaged.h                    |  12 +-
 mm/Kconfig                                    |  12 +
 mm/Makefile                                   |   1 +
 mm/bpf_thp.c                                  | 255 ++++++++++++++++++
 mm/huge_memory.c                              |   9 +
 mm/khugepaged.c                               |  18 +-
 mm/memory.c                                   |  14 +-
 tools/testing/selftests/bpf/config            |   2 +
 .../selftests/bpf/prog_tests/thp_adjust.c     | 183 +++++++++++++
 .../selftests/bpf/progs/test_thp_adjust.c     |  69 +++++
 .../bpf/progs/test_thp_adjust_failure.c       |  24 ++
 12 files changed, 605 insertions(+), 7 deletions(-)
 create mode 100644 mm/bpf_thp.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_failure.c

-- 
2.43.5



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH v4 1/4] mm: thp: add support for BPF based THP order selection
  2025-07-29  9:18 [RFC PATCH v4 0/4] mm, bpf: BPF based THP order selection Yafang Shao
@ 2025-07-29  9:18 ` Yafang Shao
  2025-07-29 15:32   ` Zi Yan
  2025-07-29  9:18 ` [RFC PATCH v4 2/4] mm: thp: add a new kfunc bpf_mm_get_mem_cgroup() Yafang Shao
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: Yafang Shao @ 2025-07-29  9:18 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung
  Cc: bpf, linux-mm, Yafang Shao

This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
THP tuning. It includes a hook get_suggested_order() [0], allowing BPF
programs to influence THP order selection based on factors such as:
- Workload identity
  For example, workloads running in specific containers or cgroups.
- Allocation context
  Whether the allocation occurs during a page fault, khugepaged, or other
  paths.
- System memory pressure
  (May require new BPF helpers to accurately assess memory pressure.)

Key Details:
- Only one BPF program can be attached at a time, but it can be updated
  dynamically to adjust the policy.
- Supports automatic mTHP order selection and per-workload THP policies.
- Only functional when THP is set to madise or always.

Experimental Status:
- Requires CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION to enable. [1]
- This feature is unstable and may evolve in future kernel versions.

Link: https://lwn.net/ml/all/9bc57721-5287-416c-aa30-46932d605f63@redhat.com/ [0]
Link: https://lwn.net/ml/all/dda67ea5-2943-497c-a8e5-d81f0733047d@lucifer.local/ [1]

Suggested-by: David Hildenbrand <david@redhat.com>
Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/huge_mm.h    |  13 +++
 include/linux/khugepaged.h |  12 ++-
 mm/Kconfig                 |  12 +++
 mm/Makefile                |   1 +
 mm/bpf_thp.c               | 172 +++++++++++++++++++++++++++++++++++++
 mm/huge_memory.c           |   9 ++
 mm/khugepaged.c            |  18 +++-
 mm/memory.c                |  14 ++-
 8 files changed, 244 insertions(+), 7 deletions(-)
 create mode 100644 mm/bpf_thp.c

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2f190c90192d..5a1527b3b6f0 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -6,6 +6,8 @@
 
 #include <linux/fs.h> /* only for vma_is_dax() */
 #include <linux/kobject.h>
+#include <linux/pgtable.h>
+#include <linux/mm.h>
 
 vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
@@ -54,6 +56,7 @@ enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
 	TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
+	TRANSPARENT_HUGEPAGE_BPF_ATTACHED,      /* BPF prog is attached */
 };
 
 struct kobject;
@@ -190,6 +193,16 @@ static inline bool hugepage_global_always(void)
 			(1<<TRANSPARENT_HUGEPAGE_FLAG);
 }
 
+#ifdef CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION
+int get_suggested_order(struct mm_struct *mm, unsigned long tva_flags, int order);
+#else
+static inline int
+get_suggested_order(struct mm_struct *mm, unsigned long tva_flags, int order)
+{
+	return order;
+}
+#endif
+
 static inline int highest_order(unsigned long orders)
 {
 	return fls_long(orders) - 1;
diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index b8d69cfbb58b..e0242968a020 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -2,6 +2,8 @@
 #ifndef _LINUX_KHUGEPAGED_H
 #define _LINUX_KHUGEPAGED_H
 
+#include <linux/huge_mm.h>
+
 extern unsigned int khugepaged_max_ptes_none __read_mostly;
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 extern struct attribute_group khugepaged_attr_group;
@@ -20,7 +22,15 @@ extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 
 static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
 {
-	if (test_bit(MMF_VM_HUGEPAGE, &oldmm->flags))
+	/*
+	 * THP allocation policy can be dynamically modified via BPF. If a
+	 * long-lived task was previously allowed to allocate THP but is no
+	 * longer permitted under the new policy, we must ensure its forked
+	 * child processes also inherit this restriction.
+	 * The MMF_VM_HUGEPAGE flag will be cleared by khugepaged.
+	 */
+	if (test_bit(MMF_VM_HUGEPAGE, &oldmm->flags) &&
+	    get_suggested_order(mm, 0, PMD_ORDER) == PMD_ORDER)
 		__khugepaged_enter(mm);
 }
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 781be3240e21..5d05a537ecde 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -908,6 +908,18 @@ config NO_PAGE_MAPCOUNT
 
 	  EXPERIMENTAL because the impact of some changes is still unclear.
 
+config EXPERIMENTAL_BPF_ORDER_SELECTION
+	bool "BPF-based THP order selection (EXPERIMENTAL)"
+	depends on TRANSPARENT_HUGEPAGE && BPF_SYSCALL
+
+	help
+	  Enable dynamic THP order selection using BPF programs. This
+	  experimental feature allows custom BPF logic to determine optimal
+	  transparent hugepage allocation sizes at runtime.
+
+	  Warning: This feature is unstable and may change in future kernel
+	  versions.
+
 endif # TRANSPARENT_HUGEPAGE
 
 # simple helper to make the code a bit easier to read
diff --git a/mm/Makefile b/mm/Makefile
index 1a7a11d4933d..562525e6a28a 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -99,6 +99,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_NUMA) += memory-tiers.o
 obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
+obj-$(CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION) += bpf_thp.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
 obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
 obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
new file mode 100644
index 000000000000..10b486dd8bc4
--- /dev/null
+++ b/mm/bpf_thp.c
@@ -0,0 +1,172 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/huge_mm.h>
+#include <linux/khugepaged.h>
+
+struct bpf_thp_ops {
+	/**
+	 * @get_suggested_order: Get the suggested highest THP order for allocation
+	 * @mm: mm_struct associated with the THP allocation
+	 * @tva_flags: TVA flags for current context
+	 *             %TVA_IN_PF: Set when in page fault context
+	 *             Other flags: Reserved for future use
+	 * @order: The highest order being considered for this THP allocation.
+	 *         %PUD_ORDER for PUD-mapped allocations
+	 *         %PMD_ORDER for PMD-mapped allocations
+	 *         %PMD_ORDER - 1 for mTHP allocations
+	 *
+	 * Rerurn: Suggested highest THP order to use for allocation. The returned
+	 * order will never exceed the input @order value.
+	 */
+	int (*get_suggested_order)(struct mm_struct *mm, unsigned long tva_flags, int order) __rcu;
+};
+
+static struct bpf_thp_ops bpf_thp;
+static DEFINE_SPINLOCK(thp_ops_lock);
+
+int get_suggested_order(struct mm_struct *mm, unsigned long tva_flags, int order)
+{
+	int (*bpf_suggested_order)(struct mm_struct *mm, unsigned long tva_flags, int order);
+	int suggested_order = order;
+
+	/* No BPF program is attached */
+	if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
+		      &transparent_hugepage_flags))
+		return suggested_order;
+
+	rcu_read_lock();
+	bpf_suggested_order = rcu_dereference(bpf_thp.get_suggested_order);
+	if (!bpf_suggested_order)
+		goto out;
+
+	suggested_order = bpf_suggested_order(mm, tva_flags, order);
+	if (suggested_order > order)
+		suggested_order = order;
+
+out:
+	rcu_read_unlock();
+	return suggested_order;
+}
+
+static bool bpf_thp_ops_is_valid_access(int off, int size,
+					enum bpf_access_type type,
+					const struct bpf_prog *prog,
+					struct bpf_insn_access_aux *info)
+{
+	return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
+}
+
+static const struct bpf_func_proto *
+bpf_thp_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	return bpf_base_func_proto(func_id, prog);
+}
+
+static const struct bpf_verifier_ops thp_bpf_verifier_ops = {
+	.get_func_proto = bpf_thp_get_func_proto,
+	.is_valid_access = bpf_thp_ops_is_valid_access,
+};
+
+static int bpf_thp_init(struct btf *btf)
+{
+	return 0;
+}
+
+static int bpf_thp_init_member(const struct btf_type *t,
+			       const struct btf_member *member,
+			       void *kdata, const void *udata)
+{
+	return 0;
+}
+
+static int bpf_thp_reg(void *kdata, struct bpf_link *link)
+{
+	struct bpf_thp_ops *ops = kdata;
+
+	spin_lock(&thp_ops_lock);
+	if (test_and_set_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
+		&transparent_hugepage_flags)) {
+		spin_unlock(&thp_ops_lock);
+		return -EBUSY;
+	}
+	WARN_ON_ONCE(bpf_thp.get_suggested_order);
+	WRITE_ONCE(bpf_thp.get_suggested_order, ops->get_suggested_order);
+	spin_unlock(&thp_ops_lock);
+	return 0;
+}
+
+static void bpf_thp_unreg(void *kdata, struct bpf_link *link)
+{
+	spin_lock(&thp_ops_lock);
+	clear_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, &transparent_hugepage_flags);
+	WARN_ON_ONCE(!bpf_thp.get_suggested_order);
+	rcu_replace_pointer(bpf_thp.get_suggested_order, NULL, lockdep_is_held(&thp_ops_lock));
+	spin_unlock(&thp_ops_lock);
+
+	synchronize_rcu();
+}
+
+static int bpf_thp_update(void *kdata, void *old_kdata, struct bpf_link *link)
+{
+	struct bpf_thp_ops *ops = kdata;
+	struct bpf_thp_ops *old = old_kdata;
+
+	if (!ops || !old)
+		return -EINVAL;
+
+	spin_lock(&thp_ops_lock);
+	if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, &transparent_hugepage_flags))
+		goto out;
+	rcu_replace_pointer(bpf_thp.get_suggested_order, ops->get_suggested_order,
+			    lockdep_is_held(&thp_ops_lock));
+
+out:
+	spin_unlock(&thp_ops_lock);
+	synchronize_rcu();
+	return 0;
+}
+
+static int bpf_thp_validate(void *kdata)
+{
+	struct bpf_thp_ops *ops = kdata;
+
+	if (!ops->get_suggested_order) {
+		pr_err("bpf_thp: required ops isn't implemented\n");
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static int suggested_order(struct mm_struct *mm, unsigned long vm_flags, int order)
+{
+	return order;
+}
+
+static struct bpf_thp_ops __bpf_thp_ops = {
+	.get_suggested_order = suggested_order,
+};
+
+static struct bpf_struct_ops bpf_bpf_thp_ops = {
+	.verifier_ops = &thp_bpf_verifier_ops,
+	.init = bpf_thp_init,
+	.init_member = bpf_thp_init_member,
+	.reg = bpf_thp_reg,
+	.unreg = bpf_thp_unreg,
+	.update = bpf_thp_update,
+	.validate = bpf_thp_validate,
+	.cfi_stubs = &__bpf_thp_ops,
+	.owner = THIS_MODULE,
+	.name = "bpf_thp_ops",
+};
+
+static int __init bpf_thp_ops_init(void)
+{
+	int err = register_bpf_struct_ops(&bpf_bpf_thp_ops, bpf_thp_ops);
+
+	if (err)
+		pr_err("bpf_thp: Failed to register struct_ops (%d)\n", err);
+	return err;
+}
+late_initcall(bpf_thp_ops_init);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d3e66136e41a..e504b601205f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1328,6 +1328,15 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 		return ret;
 	khugepaged_enter_vma(vma, vma->vm_flags);
 
+	/*
+	 * This check must occur after khugepaged_enter_vma() because:
+	 * 1. We may permit THP allocation via khugepaged
+	 * 2. While simultaneously disallowing THP allocation
+	 *    during page fault handling
+	 */
+	if (get_suggested_order(vma->vm_mm, TVA_IN_PF, PMD_ORDER) != PMD_ORDER)
+		return VM_FAULT_FALLBACK;
+
 	if (!(vmf->flags & FAULT_FLAG_WRITE) &&
 			!mm_forbids_zeropage(vma->vm_mm) &&
 			transparent_hugepage_use_zero_page()) {
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 15203ea7d007..d0b6c1b20342 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -475,7 +475,8 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
 	if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) &&
 	    hugepage_pmd_enabled()) {
 		if (thp_vma_allowable_order(vma, vm_flags, TVA_ENFORCE_SYSFS,
-					    PMD_ORDER))
+					    PMD_ORDER) &&
+		    get_suggested_order(vma->vm_mm, 0, PMD_ORDER) == PMD_ORDER)
 			__khugepaged_enter(vma->vm_mm);
 	}
 }
@@ -1448,6 +1449,11 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
 		/* khugepaged_mm_lock actually not necessary for the below */
 		mm_slot_free(mm_slot_cache, mm_slot);
 		mmdrop(mm);
+	} else if (get_suggested_order(mm, 0, PMD_ORDER) != PMD_ORDER) {
+		hash_del(&slot->hash);
+		list_del(&slot->mm_node);
+		clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
+		mm_slot_free(mm_slot_cache, mm_slot);
 	}
 }
 
@@ -2390,6 +2396,10 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 	 * the next mm on the list.
 	 */
 	vma = NULL;
+
+	/* If this mm is not suitable for the scan list, we should remove it. */
+	if (get_suggested_order(mm, 0, PMD_ORDER) != PMD_ORDER)
+		goto breakouterloop_mmap_lock;
 	if (unlikely(!mmap_read_trylock(mm)))
 		goto breakouterloop_mmap_lock;
 
@@ -2407,7 +2417,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 			break;
 		}
 		if (!thp_vma_allowable_order(vma, vma->vm_flags,
-					TVA_ENFORCE_SYSFS, PMD_ORDER)) {
+					TVA_ENFORCE_SYSFS, PMD_ORDER) ||
+		    get_suggested_order(vma->vm_mm, 0, PMD_ORDER) != PMD_ORDER) {
 skip:
 			progress++;
 			continue;
@@ -2746,6 +2757,9 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 	if (!thp_vma_allowable_order(vma, vma->vm_flags, 0, PMD_ORDER))
 		return -EINVAL;
 
+	if (get_suggested_order(vma->vm_mm, 0, PMD_ORDER) != PMD_ORDER)
+		return -EINVAL;
+
 	cc = kmalloc(sizeof(*cc), GFP_KERNEL);
 	if (!cc)
 		return -ENOMEM;
diff --git a/mm/memory.c b/mm/memory.c
index b0cda5aab398..ff3e4c92a2a2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4375,6 +4375,7 @@ static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset,
 static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
+	int order, suggested_order;
 	unsigned long orders;
 	struct folio *folio;
 	unsigned long addr;
@@ -4382,7 +4383,6 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 	spinlock_t *ptl;
 	pte_t *pte;
 	gfp_t gfp;
-	int order;
 
 	/*
 	 * If uffd is active for the vma we need per-page fault fidelity to
@@ -4399,13 +4399,16 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 	if (!zswap_never_enabled())
 		goto fallback;
 
+	suggested_order = get_suggested_order(vma->vm_mm, TVA_IN_PF, PMD_ORDER - 1);
+	if (!suggested_order)
+		goto fallback;
 	entry = pte_to_swp_entry(vmf->orig_pte);
 	/*
 	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
 	 * and suitable for swapping THP.
 	 */
 	orders = thp_vma_allowable_orders(vma, vma->vm_flags,
-			TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
+			TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(suggested_order + 1) - 1);
 	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
 	orders = thp_swap_suitable_orders(swp_offset(entry),
 					  vmf->address, orders);
@@ -4933,12 +4936,12 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	int order, suggested_order;
 	unsigned long orders;
 	struct folio *folio;
 	unsigned long addr;
 	pte_t *pte;
 	gfp_t gfp;
-	int order;
 
 	/*
 	 * If uffd is active for the vma we need per-page fault fidelity to
@@ -4947,13 +4950,16 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
 	if (unlikely(userfaultfd_armed(vma)))
 		goto fallback;
 
+	suggested_order = get_suggested_order(vma->vm_mm, TVA_IN_PF, PMD_ORDER - 1);
+	if (!suggested_order)
+		goto fallback;
 	/*
 	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
 	 * for this vma. Then filter out the orders that can't be allocated over
 	 * the faulting address and still be fully contained in the vma.
 	 */
 	orders = thp_vma_allowable_orders(vma, vma->vm_flags,
-			TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
+			TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(suggested_order + 1) - 1);
 	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
 
 	if (!orders)
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH v4 2/4] mm: thp: add a new kfunc bpf_mm_get_mem_cgroup()
  2025-07-29  9:18 [RFC PATCH v4 0/4] mm, bpf: BPF based THP order selection Yafang Shao
  2025-07-29  9:18 ` [RFC PATCH v4 1/4] mm: thp: add support for " Yafang Shao
@ 2025-07-29  9:18 ` Yafang Shao
  2025-07-29  9:18 ` [RFC PATCH v4 3/4] mm: thp: add a new kfunc bpf_mm_get_task() Yafang Shao
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Yafang Shao @ 2025-07-29  9:18 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung
  Cc: bpf, linux-mm, Yafang Shao

We will utilize this new kfunc bpf_mm_get_mem_cgroup() to retrieve the
associated mem_cgroup from the given @mm. The obtained mem_cgroup must
be released by calling bpf_put_mem_cgroup() as a paired operation.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 mm/bpf_thp.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 50 insertions(+), 1 deletion(-)

diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
index 10b486dd8bc4..040f988dbdbd 100644
--- a/mm/bpf_thp.c
+++ b/mm/bpf_thp.c
@@ -161,10 +161,59 @@ static struct bpf_struct_ops bpf_bpf_thp_ops = {
 	.name = "bpf_thp_ops",
 };
 
+__bpf_kfunc_start_defs();
+
+/**
+ * bpf_mm_get_mem_cgroup - Get the memory cgroup associated with a mm_struct.
+ * @mm: The mm_struct to query
+ *
+ * The obtained mem_cgroup must be released by calling bpf_put_mem_cgroup().
+ *
+ * Return: The associated mem_cgroup on success, or NULL on failure. Note that
+ * this function depends on CONFIG_MEMCG being enabled - it will always return
+ * NULL if CONFIG_MEMCG is not configured.
+ */
+__bpf_kfunc struct mem_cgroup *bpf_mm_get_mem_cgroup(struct mm_struct *mm)
+{
+	return get_mem_cgroup_from_mm(mm);
+}
+
+/**
+ * bpf_put_mem_cgroup - Release a memory cgroup obtained from bpf_mm_get_mem_cgroup()
+ * @memcg: The memory cgroup to release
+ */
+__bpf_kfunc void bpf_put_mem_cgroup(struct mem_cgroup *memcg)
+{
+#ifdef CONFIG_MEMCG
+	if (!memcg)
+		return;
+	css_put(&memcg->css);
+#endif
+}
+
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(bpf_thp_ids)
+BTF_ID_FLAGS(func, bpf_mm_get_mem_cgroup, KF_TRUSTED_ARGS | KF_ACQUIRE | KF_RET_NULL)
+BTF_ID_FLAGS(func, bpf_put_mem_cgroup, KF_RELEASE)
+BTF_KFUNCS_END(bpf_thp_ids)
+
+static const struct btf_kfunc_id_set bpf_thp_set = {
+	.owner = THIS_MODULE,
+	.set = &bpf_thp_ids,
+};
+
 static int __init bpf_thp_ops_init(void)
 {
-	int err = register_bpf_struct_ops(&bpf_bpf_thp_ops, bpf_thp_ops);
+	int err;
+
+	err = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &bpf_thp_set);
+	if (err) {
+		pr_err("bpf_thp: Failed to register kfunc sets (%d)\n", err);
+		return err;
+	}
 
+	err = register_bpf_struct_ops(&bpf_bpf_thp_ops, bpf_thp_ops);
 	if (err)
 		pr_err("bpf_thp: Failed to register struct_ops (%d)\n", err);
 	return err;
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH v4 3/4] mm: thp: add a new kfunc bpf_mm_get_task()
  2025-07-29  9:18 [RFC PATCH v4 0/4] mm, bpf: BPF based THP order selection Yafang Shao
  2025-07-29  9:18 ` [RFC PATCH v4 1/4] mm: thp: add support for " Yafang Shao
  2025-07-29  9:18 ` [RFC PATCH v4 2/4] mm: thp: add a new kfunc bpf_mm_get_mem_cgroup() Yafang Shao
@ 2025-07-29  9:18 ` Yafang Shao
  2025-07-29  9:18 ` [RFC PATCH v4 4/4] selftest/bpf: add selftest for BPF based THP order seletection Yafang Shao
  2025-07-29 15:07 ` [RFC PATCH v4 0/4] mm, bpf: BPF based THP order selection Zi Yan
  4 siblings, 0 replies; 13+ messages in thread
From: Yafang Shao @ 2025-07-29  9:18 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung
  Cc: bpf, linux-mm, Yafang Shao

We will utilize this new kfunc bpf_mm_get_task() to retrieve the
associated task_struct from the given @mm. The obtained task_struct must
be released by calling bpf_task_release() as a paired operation.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 mm/bpf_thp.c | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
index 040f988dbdbd..3b10a97acc31 100644
--- a/mm/bpf_thp.c
+++ b/mm/bpf_thp.c
@@ -191,11 +191,45 @@ __bpf_kfunc void bpf_put_mem_cgroup(struct mem_cgroup *memcg)
 #endif
 }
 
+/**
+ * bpf_mm_get_task - Get the task struct associated with a mm_struct.
+ * @mm: The mm_struct to query
+ *
+ * The obtained task_struct must be released by calling bpf_task_release().
+ *
+ * Return: The associated task_struct on success, or NULL on failure. Note that
+ * this function depends on CONFIG_MEMCG being enabled - it will always return
+ * NULL if CONFIG_MEMCG is not configured.
+ */
+__bpf_kfunc struct task_struct *bpf_mm_get_task(struct mm_struct *mm)
+{
+#ifdef CONFIG_MEMCG
+	struct task_struct *task;
+
+	if (!mm)
+		return NULL;
+	rcu_read_lock();
+	task = rcu_dereference(mm->owner);
+	if (!task)
+		goto out;
+	if (!refcount_inc_not_zero(&task->rcu_users))
+		goto out;
+
+	rcu_read_unlock();
+	return task;
+
+out:
+	rcu_read_unlock();
+#endif
+	return NULL;
+}
+
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(bpf_thp_ids)
 BTF_ID_FLAGS(func, bpf_mm_get_mem_cgroup, KF_TRUSTED_ARGS | KF_ACQUIRE | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_put_mem_cgroup, KF_RELEASE)
+BTF_ID_FLAGS(func, bpf_mm_get_task, KF_TRUSTED_ARGS | KF_ACQUIRE | KF_RET_NULL)
 BTF_KFUNCS_END(bpf_thp_ids)
 
 static const struct btf_kfunc_id_set bpf_thp_set = {
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH v4 4/4] selftest/bpf: add selftest for BPF based THP order seletection
  2025-07-29  9:18 [RFC PATCH v4 0/4] mm, bpf: BPF based THP order selection Yafang Shao
                   ` (2 preceding siblings ...)
  2025-07-29  9:18 ` [RFC PATCH v4 3/4] mm: thp: add a new kfunc bpf_mm_get_task() Yafang Shao
@ 2025-07-29  9:18 ` Yafang Shao
  2025-07-29 15:36   ` Zi Yan
  2025-07-29 15:07 ` [RFC PATCH v4 0/4] mm, bpf: BPF based THP order selection Zi Yan
  4 siblings, 1 reply; 13+ messages in thread
From: Yafang Shao @ 2025-07-29  9:18 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung
  Cc: bpf, linux-mm, Yafang Shao

This self-test verifies that PMD-mapped THP allocation is restricted in
page faults for tasks within a specific cgroup, while still permitting
THP allocation via khugepaged.

Since THP allocation depends on various factors (e.g., system memory
pressure), using the actual allocated THP size for validation is
unreliable. Instead, we check the return value of get_suggested_order(),
which indicates whether the system intends to allocate a THP, regardless of
whether the allocation ultimately succeeds.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 tools/testing/selftests/bpf/config            |   2 +
 .../selftests/bpf/prog_tests/thp_adjust.c     | 183 ++++++++++++++++++
 .../selftests/bpf/progs/test_thp_adjust.c     |  69 +++++++
 .../bpf/progs/test_thp_adjust_failure.c       |  24 +++
 4 files changed, 278 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_failure.c

diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
index f74e1ea0ad3b..0364f945347d 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -118,3 +118,5 @@ CONFIG_XDP_SOCKETS=y
 CONFIG_XFRM_INTERFACE=y
 CONFIG_TCP_CONG_DCTCP=y
 CONFIG_TCP_CONG_BBR=y
+CONFIG_TRANSPARENT_HUGEPAGE=y
+CONFIG_MEMCG=y
diff --git a/tools/testing/selftests/bpf/prog_tests/thp_adjust.c b/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
new file mode 100644
index 000000000000..31d03383cbb8
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
@@ -0,0 +1,183 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <sys/mman.h>
+#include <test_progs.h>
+#include "cgroup_helpers.h"
+#include "test_thp_adjust.skel.h"
+#include "test_thp_adjust_failure.skel.h"
+
+#define LEN (16 * 1024 * 1024) /* 16MB */
+#define THP_ENABLED_PATH "/sys/kernel/mm/transparent_hugepage/enabled"
+
+static char *thp_addr;
+static char old_mode[32];
+
+int thp_mode_save(void)
+{
+	const char *start, *end;
+	char buf[128];
+	int fd, err;
+	size_t len;
+
+	fd = open(THP_ENABLED_PATH, O_RDONLY);
+	if (fd == -1)
+		return -1;
+
+	err = read(fd, buf, sizeof(buf) - 1);
+	if (err == -1)
+		goto close;
+
+	start = strchr(buf, '[');
+	end = start ? strchr(start, ']') : NULL;
+	if (!start || !end || end <= start) {
+		err = -1;
+		goto close;
+	}
+
+	len = end - start - 1;
+	if (len >= sizeof(old_mode))
+		len = sizeof(old_mode) - 1;
+	strncpy(old_mode, start + 1, len);
+	old_mode[len] = '\0';
+
+close:
+	close(fd);
+	return err;
+}
+
+int thp_set(const char *desired_mode)
+{
+	int fd, err;
+
+	fd = open(THP_ENABLED_PATH, O_RDWR);
+	if (fd == -1)
+		return -1;
+
+	err = write(fd, desired_mode, strlen(desired_mode));
+	close(fd);
+	return err;
+}
+
+int thp_reset(void)
+{
+	int fd, err;
+
+	fd = open(THP_ENABLED_PATH, O_WRONLY);
+	if (fd == -1)
+		return -1;
+
+	err = write(fd, old_mode, strlen(old_mode));
+	close(fd);
+	return err;
+}
+
+int thp_alloc(void)
+{
+	int err, i;
+
+	thp_addr = mmap(NULL, LEN, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
+	if (thp_addr == MAP_FAILED)
+		return -1;
+
+	err = madvise(thp_addr, LEN, MADV_HUGEPAGE);
+	if (err == -1)
+		goto unmap;
+
+	for (i = 0; i < LEN; i += 4096)
+		thp_addr[i] = 1;
+	return 0;
+
+unmap:
+	munmap(thp_addr, LEN);
+	return -1;
+}
+
+void thp_free(void)
+{
+	if (!thp_addr)
+		return;
+	munmap(thp_addr, LEN);
+}
+
+void subtest_thp_adjust(void)
+{
+	struct bpf_link *fentry_link, *ops_link;
+	struct test_thp_adjust *skel;
+	int err, cgrp_fd, cgrp_id;
+
+	err = setup_cgroup_environment();
+	if (!ASSERT_OK(err, "cgrp_env_setup"))
+		return;
+
+	cgrp_fd = create_and_get_cgroup("thp_adjust");
+	if (!ASSERT_GE(cgrp_fd, 0, "create_and_get_cgroup"))
+		goto cleanup;
+
+	err = join_cgroup("thp_adjust");
+	if (!ASSERT_OK(err, "join_cgroup"))
+		goto close_fd;
+
+	cgrp_id = get_cgroup_id("thp_adjust");
+	if (!ASSERT_GE(cgrp_id, 0, "create_and_get_cgroup"))
+		goto join_root;
+
+	if (!ASSERT_NEQ(thp_mode_save(), -1, "THP mode save"))
+		goto join_root;
+	if (!ASSERT_GE(thp_set("madvise"), 0, "THP mode set"))
+		goto join_root;
+
+	skel = test_thp_adjust__open();
+	if (!ASSERT_OK_PTR(skel, "open"))
+		goto thp_reset;
+
+	skel->bss->cgrp_id = cgrp_id;
+	skel->bss->target_pid = getpid();
+
+	err = test_thp_adjust__load(skel);
+	if (!ASSERT_OK(err, "load"))
+		goto destroy;
+
+	fentry_link = bpf_program__attach_trace(skel->progs.thp_run);
+	if (!ASSERT_OK_PTR(fentry_link, "attach fentry"))
+		goto destroy;
+
+	ops_link = bpf_map__attach_struct_ops(skel->maps.thp);
+	if (!ASSERT_OK_PTR(ops_link, "attach struct_ops"))
+		goto destroy;
+
+	if (!ASSERT_NEQ(thp_alloc(), -1, "THP alloc"))
+		goto destroy;
+
+	/* After attaching struct_ops, THP will be allocated only in khugepaged . */
+	if (!ASSERT_EQ(skel->bss->pf_alloc, 0, "alloc_in_pf"))
+		goto thp_free;
+	if (!ASSERT_GT(skel->bss->pf_disallow, 0, "alloc_in_pf"))
+		goto thp_free;
+
+	if (!ASSERT_GT(skel->bss->khugepaged_alloc, 0, "alloc_in_khugepaged"))
+		goto thp_free;
+	ASSERT_EQ(skel->bss->khugepaged_disallow, 0, "alloc_in_pf");
+
+thp_free:
+	thp_free();
+destroy:
+	test_thp_adjust__destroy(skel);
+thp_reset:
+	ASSERT_GE(thp_reset(), 0, "THP mode reset");
+join_root:
+	/* We must join the root cgroup before removing the created cgroup. */
+	err = join_root_cgroup();
+	ASSERT_OK(err, "join_cgroup to root");
+close_fd:
+	close(cgrp_fd);
+	remove_cgroup("thp_adjust");
+cleanup:
+	cleanup_cgroup_environment();
+}
+
+void test_thp_adjust(void)
+{
+	if (test__start_subtest("thp_adjust"))
+		subtest_thp_adjust();
+	RUN_TESTS(test_thp_adjust_failure);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_thp_adjust.c b/tools/testing/selftests/bpf/progs/test_thp_adjust.c
new file mode 100644
index 000000000000..bb4aad50c7a8
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_thp_adjust.c
@@ -0,0 +1,69 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+#define TVA_IN_PF (1 << 1)
+
+int pf_alloc, pf_disallow, khugepaged_alloc, khugepaged_disallow;
+int cgrp_id, target_pid;
+
+/* Detecting whether a task can successfully allocate THP is unreliable because
+ * it may be influenced by system memory pressure. Instead of making the result
+ * dependent on unpredictable factors, we should simply check
+ * get_suggested_order()'s return value, which is deterministic.
+ */
+SEC("fexit/get_suggested_order")
+int BPF_PROG(thp_run, struct mm_struct *mm, unsigned long tva_flags, int order, int retval)
+{
+	struct task_struct *current = bpf_get_current_task_btf();
+
+	if (current->pid != target_pid || order != 9)
+		return 0;
+
+	if (tva_flags & TVA_IN_PF) {
+		if (retval == 9)
+			pf_alloc++;
+		else if (!retval)
+			pf_disallow++;
+	} else {
+		if (retval == 9)
+			khugepaged_alloc++;
+		else if (!retval)
+			khugepaged_disallow++;
+	}
+	return 0;
+}
+
+SEC("struct_ops/get_suggested_order")
+int BPF_PROG(bpf_suggested_order, struct mm_struct *mm, unsigned long tva_flags, int order)
+{
+	struct mem_cgroup *memcg = bpf_mm_get_mem_cgroup(mm);
+	int suggested_order = order;
+
+	/* Only works when CONFIG_MEMCG is enabled. */
+	if (!memcg)
+		return suggested_order;
+
+	if (memcg->css.cgroup->kn->id == cgrp_id) {
+		/* BPF THP allocation policy:
+		 * - Disallow PMD allocation in page fault context
+		 */
+		if (tva_flags & TVA_IN_PF && order == 9) {
+			suggested_order = 0;
+			goto out;
+		}
+	}
+
+out:
+	bpf_put_mem_cgroup(memcg);
+	return suggested_order;
+}
+
+SEC(".struct_ops.link")
+struct bpf_thp_ops thp = {
+	.get_suggested_order = (void *)bpf_suggested_order,
+};
diff --git a/tools/testing/selftests/bpf/progs/test_thp_adjust_failure.c b/tools/testing/selftests/bpf/progs/test_thp_adjust_failure.c
new file mode 100644
index 000000000000..b080aead9b87
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_thp_adjust_failure.c
@@ -0,0 +1,24 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+#include "bpf_misc.h"
+
+char _license[] SEC("license") = "GPL";
+
+SEC("struct_ops/get_suggested_order")
+__failure __msg("Unreleased reference")
+int BPF_PROG(unreleased_task, struct mm_struct *mm, bool vma_madvised)
+{
+	struct task_struct *p = bpf_mm_get_task(mm);
+
+	/* The task should be released with bpf_task_release() */
+	return p ? 9 : 0;
+}
+
+SEC(".struct_ops.link")
+struct bpf_thp_ops thp = {
+	.get_suggested_order = (void *)unreleased_task,
+};
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v4 0/4] mm, bpf: BPF based THP order selection
  2025-07-29  9:18 [RFC PATCH v4 0/4] mm, bpf: BPF based THP order selection Yafang Shao
                   ` (3 preceding siblings ...)
  2025-07-29  9:18 ` [RFC PATCH v4 4/4] selftest/bpf: add selftest for BPF based THP order seletection Yafang Shao
@ 2025-07-29 15:07 ` Zi Yan
  2025-07-30  2:31   ` Yafang Shao
  4 siblings, 1 reply; 13+ messages in thread
From: Zi Yan @ 2025-07-29 15:07 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
	willy, ast, daniel, andrii, ameryhung, bpf, linux-mm

On 29 Jul 2025, at 5:18, Yafang Shao wrote:

> Background
> ----------
>
> Our production servers consistently configure THP to "never" due to
> historical incidents caused by its behavior. Key issues include:
> - Increased Memory Consumption
>   THP significantly raises overall memory usage, reducing available memory
>   for workloads.
>
> - Latency Spikes
>   Random latency spikes occur due to frequent memory compaction triggered
>   by THP.
>
> - Lack of Fine-Grained Control
>   THP tuning is globally configured, making it unsuitable for containerized
>   environments. When multiple workloads share a host, enabling THP without
>   per-workload control leads to unpredictable behavior.
>
> Due to these issues, administrators avoid switching to madvise or always
> modes—unless per-workload THP control is implemented.
>
> To address this, we propose BPF-based THP policy for flexible adjustment.
> Additionally, as David mentioned [0], this mechanism can also serve as a

The link to [0] is missing. :)

> policy prototyping tool (test policies via BPF before upstreaming them).
>
> Proposed Solution
> -----------------
>
> As suggested by David [0], we introduce a new BPF interface:
>
> /**
>  * @get_suggested_order: Get the suggested highest THP order for allocation
>  * @mm: mm_struct associated with the THP allocation
>  * @tva_flags: TVA flags for current context
>  *             %TVA_IN_PF: Set when in page fault context
>  *             Other flags: Reserved for future use
>  * @order: The highest order being considered for this THP allocation.
>  *         %PUD_ORDER for PUD-mapped allocations

There is no PUD THP yet and the highest THP order is PMD_ORDER. It is better
to remove the line above to avoid confusion.

>  *         %PMD_ORDER for PMD-mapped allocations
>  *         %PMD_ORDER - 1 for mTHP allocations
>  *
>  * Rerurn: Suggested highest THP order to use for allocation. The returned
>  * order will never exceed the input @order value.
>  */
> int (*get_suggested_order)(struct mm_struct *mm, unsigned long tva_flags, int order);
>
> This interface:
> - Supports both use cases (per-workload tuning + policy prototyping).
> - Can be extended with BPF helpers (e.g., for memory pressure awareness).

IIRC, your initial RFC works at VMA level, but this patch targets mm level.
Is mm sufficient for your use case? Are you planning to extend the
BFP interface to VMA in the future? Just curious.

>
> This is an experimental feature. To use it, you must enable
> CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION.
>
> Warning:
> - The interface may change
> - Behavior may differ in future kernel versions
> - We might remove it in the future
>
> A simple test case is included in Patch #4.
>
> Changes:
> RFC v3->v4:
> - Use a new interface get_suggested_order() (David)
> - Mark it as experimental (David, Lorenzo)
> - Code improvement in THP (Usama)
> - Code improvement in BPF struct ops (Amery)
>
> RFC v2->v3: https://lwn.net/Articles/1024545/
> - Finer-graind tuning based on madvise or always mode (David, Lorenzo)
> - Use BPF to write more advanced policies logic (David, Lorenzo)
>
> RFC v1->v2: https://lwn.net/Articles/1021783/
> The main changes are as follows,
> - Use struct_ops instead of fmod_ret (Alexei)
> - Introduce a new THP mode (Johannes)
> - Introduce new helpers for BPF hook (Zi)
> - Refine the commit log
>
> RFC v1: https://lwn.net/Articles/1019290/
>
> Yafang Shao (4):
>   mm: thp: add support for BPF based THP order selection
>   mm: thp: add a new kfunc bpf_mm_get_mem_cgroup()
>   mm: thp: add a new kfunc bpf_mm_get_task()
>   selftest/bpf: add selftest for BPF based THP order seletection
>
>  include/linux/huge_mm.h                       |  13 +
>  include/linux/khugepaged.h                    |  12 +-
>  mm/Kconfig                                    |  12 +
>  mm/Makefile                                   |   1 +
>  mm/bpf_thp.c                                  | 255 ++++++++++++++++++
>  mm/huge_memory.c                              |   9 +
>  mm/khugepaged.c                               |  18 +-
>  mm/memory.c                                   |  14 +-
>  tools/testing/selftests/bpf/config            |   2 +
>  .../selftests/bpf/prog_tests/thp_adjust.c     | 183 +++++++++++++
>  .../selftests/bpf/progs/test_thp_adjust.c     |  69 +++++
>  .../bpf/progs/test_thp_adjust_failure.c       |  24 ++
>  12 files changed, 605 insertions(+), 7 deletions(-)
>  create mode 100644 mm/bpf_thp.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_failure.c
>
> -- 
> 2.43.5


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v4 1/4] mm: thp: add support for BPF based THP order selection
  2025-07-29  9:18 ` [RFC PATCH v4 1/4] mm: thp: add support for " Yafang Shao
@ 2025-07-29 15:32   ` Zi Yan
  2025-07-30  2:36     ` Yafang Shao
  0 siblings, 1 reply; 13+ messages in thread
From: Zi Yan @ 2025-07-29 15:32 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
	willy, ast, daniel, andrii, ameryhung, bpf, linux-mm

On 29 Jul 2025, at 5:18, Yafang Shao wrote:

> This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
> THP tuning. It includes a hook get_suggested_order() [0], allowing BPF
> programs to influence THP order selection based on factors such as:
> - Workload identity
>   For example, workloads running in specific containers or cgroups.
> - Allocation context
>   Whether the allocation occurs during a page fault, khugepaged, or other
>   paths.
> - System memory pressure
>   (May require new BPF helpers to accurately assess memory pressure.)
>
> Key Details:
> - Only one BPF program can be attached at a time, but it can be updated
>   dynamically to adjust the policy.
> - Supports automatic mTHP order selection and per-workload THP policies.
> - Only functional when THP is set to madise or always.
>
> Experimental Status:
> - Requires CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION to enable. [1]
> - This feature is unstable and may evolve in future kernel versions.
>
> Link: https://lwn.net/ml/all/9bc57721-5287-416c-aa30-46932d605f63@redhat.com/ [0]
> Link: https://lwn.net/ml/all/dda67ea5-2943-497c-a8e5-d81f0733047d@lucifer.local/ [1]
>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  include/linux/huge_mm.h    |  13 +++
>  include/linux/khugepaged.h |  12 ++-
>  mm/Kconfig                 |  12 +++
>  mm/Makefile                |   1 +
>  mm/bpf_thp.c               | 172 +++++++++++++++++++++++++++++++++++++
>  mm/huge_memory.c           |   9 ++
>  mm/khugepaged.c            |  18 +++-
>  mm/memory.c                |  14 ++-
>  8 files changed, 244 insertions(+), 7 deletions(-)
>  create mode 100644 mm/bpf_thp.c
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 2f190c90192d..5a1527b3b6f0 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -6,6 +6,8 @@
>
>  #include <linux/fs.h> /* only for vma_is_dax() */
>  #include <linux/kobject.h>
> +#include <linux/pgtable.h>
> +#include <linux/mm.h>
>
>  vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
>  int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> @@ -54,6 +56,7 @@ enum transparent_hugepage_flag {
>  	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
>  	TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
>  	TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
> +	TRANSPARENT_HUGEPAGE_BPF_ATTACHED,      /* BPF prog is attached */
>  };
>
>  struct kobject;
> @@ -190,6 +193,16 @@ static inline bool hugepage_global_always(void)
>  			(1<<TRANSPARENT_HUGEPAGE_FLAG);
>  }
>
> +#ifdef CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION
> +int get_suggested_order(struct mm_struct *mm, unsigned long tva_flags, int order);
> +#else
> +static inline int
> +get_suggested_order(struct mm_struct *mm, unsigned long tva_flags, int order)
> +{
> +	return order;
> +}
> +#endif
> +
>  static inline int highest_order(unsigned long orders)
>  {
>  	return fls_long(orders) - 1;
> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> index b8d69cfbb58b..e0242968a020 100644
> --- a/include/linux/khugepaged.h
> +++ b/include/linux/khugepaged.h
> @@ -2,6 +2,8 @@
>  #ifndef _LINUX_KHUGEPAGED_H
>  #define _LINUX_KHUGEPAGED_H
>
> +#include <linux/huge_mm.h>
> +
>  extern unsigned int khugepaged_max_ptes_none __read_mostly;
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  extern struct attribute_group khugepaged_attr_group;
> @@ -20,7 +22,15 @@ extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>
>  static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
>  {
> -	if (test_bit(MMF_VM_HUGEPAGE, &oldmm->flags))
> +	/*
> +	 * THP allocation policy can be dynamically modified via BPF. If a
> +	 * long-lived task was previously allowed to allocate THP but is no
> +	 * longer permitted under the new policy, we must ensure its forked
> +	 * child processes also inherit this restriction.

The comment is probably better to be:

THP allocation policy can be dynamically modified via BPF. Even if a task
was allowed to allocate THPs, BPF can decide whether its forked child
can allocate THPs.

The MMF_VM_HUGEPAGE flag will be cleared by khugepaged.

Because the code here just wants to change a forked child’s mm flag. It has
nothing to do with its parent THP policy.

> +	 * The MMF_VM_HUGEPAGE flag will be cleared by khugepaged.
> +	 */
> +	if (test_bit(MMF_VM_HUGEPAGE, &oldmm->flags) &&
> +	    get_suggested_order(mm, 0, PMD_ORDER) == PMD_ORDER)

Will it work for mTHPs? Nico is adding mTHP support for khugepaged[1].
What if a BPF program wants khugepaged to work on some mTHP orders.

Maybe get_suggested_order() should accept a bitmask of all allowed
orders and return a bitmask as well. Only if the returned bitmask
is 0, khugepaged is not entered.

[1] https://lore.kernel.org/linux-mm/20250714003207.113275-1-npache@redhat.com/

>  		__khugepaged_enter(mm);
>  }
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 781be3240e21..5d05a537ecde 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -908,6 +908,18 @@ config NO_PAGE_MAPCOUNT
>
>  	  EXPERIMENTAL because the impact of some changes is still unclear.
>
> +config EXPERIMENTAL_BPF_ORDER_SELECTION
> +	bool "BPF-based THP order selection (EXPERIMENTAL)"
> +	depends on TRANSPARENT_HUGEPAGE && BPF_SYSCALL
> +
> +	help
> +	  Enable dynamic THP order selection using BPF programs. This
> +	  experimental feature allows custom BPF logic to determine optimal
> +	  transparent hugepage allocation sizes at runtime.
> +
> +	  Warning: This feature is unstable and may change in future kernel
> +	  versions.
> +
>  endif # TRANSPARENT_HUGEPAGE
>
>  # simple helper to make the code a bit easier to read
> diff --git a/mm/Makefile b/mm/Makefile
> index 1a7a11d4933d..562525e6a28a 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -99,6 +99,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
>  obj-$(CONFIG_NUMA) += memory-tiers.o
>  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> +obj-$(CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION) += bpf_thp.o
>  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
>  obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
>  obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
> diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
> new file mode 100644
> index 000000000000..10b486dd8bc4
> --- /dev/null
> +++ b/mm/bpf_thp.c
> @@ -0,0 +1,172 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/bpf.h>
> +#include <linux/btf.h>
> +#include <linux/huge_mm.h>
> +#include <linux/khugepaged.h>
> +
> +struct bpf_thp_ops {
> +	/**
> +	 * @get_suggested_order: Get the suggested highest THP order for allocation
> +	 * @mm: mm_struct associated with the THP allocation
> +	 * @tva_flags: TVA flags for current context
> +	 *             %TVA_IN_PF: Set when in page fault context
> +	 *             Other flags: Reserved for future use
> +	 * @order: The highest order being considered for this THP allocation.
> +	 *         %PUD_ORDER for PUD-mapped allocations

Like I mentioned in the cover letter, PMD_ORDER is the highest order
mm currently supports. I wonder if it is better to be a bitmask of orders
to better support mTHP.

> +	 *         %PMD_ORDER for PMD-mapped allocations
> +	 *         %PMD_ORDER - 1 for mTHP allocations
> +	 *
> +	 * Rerurn: Suggested highest THP order to use for allocation. The returned
> +	 * order will never exceed the input @order value.
> +	 */
> +	int (*get_suggested_order)(struct mm_struct *mm, unsigned long tva_flags, int order) __rcu;
> +};
> +


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v4 4/4] selftest/bpf: add selftest for BPF based THP order seletection
  2025-07-29  9:18 ` [RFC PATCH v4 4/4] selftest/bpf: add selftest for BPF based THP order seletection Yafang Shao
@ 2025-07-29 15:36   ` Zi Yan
  2025-07-30  2:38     ` Yafang Shao
  0 siblings, 1 reply; 13+ messages in thread
From: Zi Yan @ 2025-07-29 15:36 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
	willy, ast, daniel, andrii, ameryhung, bpf, linux-mm

On 29 Jul 2025, at 5:18, Yafang Shao wrote:

> This self-test verifies that PMD-mapped THP allocation is restricted in
> page faults for tasks within a specific cgroup, while still permitting
> THP allocation via khugepaged.
>
> Since THP allocation depends on various factors (e.g., system memory
> pressure), using the actual allocated THP size for validation is
> unreliable. Instead, we check the return value of get_suggested_order(),
> which indicates whether the system intends to allocate a THP, regardless of
> whether the allocation ultimately succeeds.
>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  tools/testing/selftests/bpf/config            |   2 +
>  .../selftests/bpf/prog_tests/thp_adjust.c     | 183 ++++++++++++++++++
>  .../selftests/bpf/progs/test_thp_adjust.c     |  69 +++++++
>  .../bpf/progs/test_thp_adjust_failure.c       |  24 +++
>  4 files changed, 278 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_failure.c
>

The program below will only work on architectures with 4KB base page
and PMD order is 9. It is better to read base page size and PMD page size
from the system.



> diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
> index f74e1ea0ad3b..0364f945347d 100644
> --- a/tools/testing/selftests/bpf/config
> +++ b/tools/testing/selftests/bpf/config
> @@ -118,3 +118,5 @@ CONFIG_XDP_SOCKETS=y
>  CONFIG_XFRM_INTERFACE=y
>  CONFIG_TCP_CONG_DCTCP=y
>  CONFIG_TCP_CONG_BBR=y
> +CONFIG_TRANSPARENT_HUGEPAGE=y
> +CONFIG_MEMCG=y
> diff --git a/tools/testing/selftests/bpf/prog_tests/thp_adjust.c b/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
> new file mode 100644
> index 000000000000..31d03383cbb8
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
> @@ -0,0 +1,183 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <sys/mman.h>
> +#include <test_progs.h>
> +#include "cgroup_helpers.h"
> +#include "test_thp_adjust.skel.h"
> +#include "test_thp_adjust_failure.skel.h"
> +
> +#define LEN (16 * 1024 * 1024) /* 16MB */
> +#define THP_ENABLED_PATH "/sys/kernel/mm/transparent_hugepage/enabled"
> +
> +static char *thp_addr;
> +static char old_mode[32];
> +
> +int thp_mode_save(void)
> +{
> +	const char *start, *end;
> +	char buf[128];
> +	int fd, err;
> +	size_t len;
> +
> +	fd = open(THP_ENABLED_PATH, O_RDONLY);
> +	if (fd == -1)
> +		return -1;
> +
> +	err = read(fd, buf, sizeof(buf) - 1);
> +	if (err == -1)
> +		goto close;
> +
> +	start = strchr(buf, '[');
> +	end = start ? strchr(start, ']') : NULL;
> +	if (!start || !end || end <= start) {
> +		err = -1;
> +		goto close;
> +	}
> +
> +	len = end - start - 1;
> +	if (len >= sizeof(old_mode))
> +		len = sizeof(old_mode) - 1;
> +	strncpy(old_mode, start + 1, len);
> +	old_mode[len] = '\0';
> +
> +close:
> +	close(fd);
> +	return err;
> +}
> +
> +int thp_set(const char *desired_mode)
> +{
> +	int fd, err;
> +
> +	fd = open(THP_ENABLED_PATH, O_RDWR);
> +	if (fd == -1)
> +		return -1;
> +
> +	err = write(fd, desired_mode, strlen(desired_mode));
> +	close(fd);
> +	return err;
> +}
> +
> +int thp_reset(void)
> +{
> +	int fd, err;
> +
> +	fd = open(THP_ENABLED_PATH, O_WRONLY);
> +	if (fd == -1)
> +		return -1;
> +
> +	err = write(fd, old_mode, strlen(old_mode));
> +	close(fd);
> +	return err;
> +}
> +
> +int thp_alloc(void)
> +{
> +	int err, i;
> +
> +	thp_addr = mmap(NULL, LEN, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
> +	if (thp_addr == MAP_FAILED)
> +		return -1;
> +
> +	err = madvise(thp_addr, LEN, MADV_HUGEPAGE);
> +	if (err == -1)
> +		goto unmap;
> +
> +	for (i = 0; i < LEN; i += 4096)
> +		thp_addr[i] = 1;
> +	return 0;
> +
> +unmap:
> +	munmap(thp_addr, LEN);
> +	return -1;
> +}
> +
> +void thp_free(void)
> +{
> +	if (!thp_addr)
> +		return;
> +	munmap(thp_addr, LEN);
> +}
> +
> +void subtest_thp_adjust(void)
> +{
> +	struct bpf_link *fentry_link, *ops_link;
> +	struct test_thp_adjust *skel;
> +	int err, cgrp_fd, cgrp_id;
> +
> +	err = setup_cgroup_environment();
> +	if (!ASSERT_OK(err, "cgrp_env_setup"))
> +		return;
> +
> +	cgrp_fd = create_and_get_cgroup("thp_adjust");
> +	if (!ASSERT_GE(cgrp_fd, 0, "create_and_get_cgroup"))
> +		goto cleanup;
> +
> +	err = join_cgroup("thp_adjust");
> +	if (!ASSERT_OK(err, "join_cgroup"))
> +		goto close_fd;
> +
> +	cgrp_id = get_cgroup_id("thp_adjust");
> +	if (!ASSERT_GE(cgrp_id, 0, "create_and_get_cgroup"))
> +		goto join_root;
> +
> +	if (!ASSERT_NEQ(thp_mode_save(), -1, "THP mode save"))
> +		goto join_root;
> +	if (!ASSERT_GE(thp_set("madvise"), 0, "THP mode set"))
> +		goto join_root;
> +
> +	skel = test_thp_adjust__open();
> +	if (!ASSERT_OK_PTR(skel, "open"))
> +		goto thp_reset;
> +
> +	skel->bss->cgrp_id = cgrp_id;
> +	skel->bss->target_pid = getpid();
> +
> +	err = test_thp_adjust__load(skel);
> +	if (!ASSERT_OK(err, "load"))
> +		goto destroy;
> +
> +	fentry_link = bpf_program__attach_trace(skel->progs.thp_run);
> +	if (!ASSERT_OK_PTR(fentry_link, "attach fentry"))
> +		goto destroy;
> +
> +	ops_link = bpf_map__attach_struct_ops(skel->maps.thp);
> +	if (!ASSERT_OK_PTR(ops_link, "attach struct_ops"))
> +		goto destroy;
> +
> +	if (!ASSERT_NEQ(thp_alloc(), -1, "THP alloc"))
> +		goto destroy;
> +
> +	/* After attaching struct_ops, THP will be allocated only in khugepaged . */
> +	if (!ASSERT_EQ(skel->bss->pf_alloc, 0, "alloc_in_pf"))
> +		goto thp_free;
> +	if (!ASSERT_GT(skel->bss->pf_disallow, 0, "alloc_in_pf"))
> +		goto thp_free;
> +
> +	if (!ASSERT_GT(skel->bss->khugepaged_alloc, 0, "alloc_in_khugepaged"))
> +		goto thp_free;
> +	ASSERT_EQ(skel->bss->khugepaged_disallow, 0, "alloc_in_pf");
> +
> +thp_free:
> +	thp_free();
> +destroy:
> +	test_thp_adjust__destroy(skel);
> +thp_reset:
> +	ASSERT_GE(thp_reset(), 0, "THP mode reset");
> +join_root:
> +	/* We must join the root cgroup before removing the created cgroup. */
> +	err = join_root_cgroup();
> +	ASSERT_OK(err, "join_cgroup to root");
> +close_fd:
> +	close(cgrp_fd);
> +	remove_cgroup("thp_adjust");
> +cleanup:
> +	cleanup_cgroup_environment();
> +}
> +
> +void test_thp_adjust(void)
> +{
> +	if (test__start_subtest("thp_adjust"))
> +		subtest_thp_adjust();
> +	RUN_TESTS(test_thp_adjust_failure);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/test_thp_adjust.c b/tools/testing/selftests/bpf/progs/test_thp_adjust.c
> new file mode 100644
> index 000000000000..bb4aad50c7a8
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/test_thp_adjust.c
> @@ -0,0 +1,69 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include "vmlinux.h"
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +
> +char _license[] SEC("license") = "GPL";
> +
> +#define TVA_IN_PF (1 << 1)
> +
> +int pf_alloc, pf_disallow, khugepaged_alloc, khugepaged_disallow;
> +int cgrp_id, target_pid;
> +
> +/* Detecting whether a task can successfully allocate THP is unreliable because
> + * it may be influenced by system memory pressure. Instead of making the result
> + * dependent on unpredictable factors, we should simply check
> + * get_suggested_order()'s return value, which is deterministic.
> + */
> +SEC("fexit/get_suggested_order")
> +int BPF_PROG(thp_run, struct mm_struct *mm, unsigned long tva_flags, int order, int retval)
> +{
> +	struct task_struct *current = bpf_get_current_task_btf();
> +
> +	if (current->pid != target_pid || order != 9)
> +		return 0;
> +
> +	if (tva_flags & TVA_IN_PF) {
> +		if (retval == 9)
> +			pf_alloc++;
> +		else if (!retval)
> +			pf_disallow++;
> +	} else {
> +		if (retval == 9)
> +			khugepaged_alloc++;
> +		else if (!retval)
> +			khugepaged_disallow++;
> +	}
> +	return 0;
> +}
> +
> +SEC("struct_ops/get_suggested_order")
> +int BPF_PROG(bpf_suggested_order, struct mm_struct *mm, unsigned long tva_flags, int order)
> +{
> +	struct mem_cgroup *memcg = bpf_mm_get_mem_cgroup(mm);
> +	int suggested_order = order;
> +
> +	/* Only works when CONFIG_MEMCG is enabled. */
> +	if (!memcg)
> +		return suggested_order;
> +
> +	if (memcg->css.cgroup->kn->id == cgrp_id) {
> +		/* BPF THP allocation policy:
> +		 * - Disallow PMD allocation in page fault context
> +		 */
> +		if (tva_flags & TVA_IN_PF && order == 9) {
> +			suggested_order = 0;
> +			goto out;
> +		}
> +	}
> +
> +out:
> +	bpf_put_mem_cgroup(memcg);
> +	return suggested_order;
> +}
> +
> +SEC(".struct_ops.link")
> +struct bpf_thp_ops thp = {
> +	.get_suggested_order = (void *)bpf_suggested_order,
> +};
> diff --git a/tools/testing/selftests/bpf/progs/test_thp_adjust_failure.c b/tools/testing/selftests/bpf/progs/test_thp_adjust_failure.c
> new file mode 100644
> index 000000000000..b080aead9b87
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/test_thp_adjust_failure.c
> @@ -0,0 +1,24 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include "vmlinux.h"
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +
> +#include "bpf_misc.h"
> +
> +char _license[] SEC("license") = "GPL";
> +
> +SEC("struct_ops/get_suggested_order")
> +__failure __msg("Unreleased reference")
> +int BPF_PROG(unreleased_task, struct mm_struct *mm, bool vma_madvised)
> +{
> +	struct task_struct *p = bpf_mm_get_task(mm);
> +
> +	/* The task should be released with bpf_task_release() */
> +	return p ? 9 : 0;
> +}
> +
> +SEC(".struct_ops.link")
> +struct bpf_thp_ops thp = {
> +	.get_suggested_order = (void *)unreleased_task,
> +};
> -- 
> 2.43.5


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v4 0/4] mm, bpf: BPF based THP order selection
  2025-07-29 15:07 ` [RFC PATCH v4 0/4] mm, bpf: BPF based THP order selection Zi Yan
@ 2025-07-30  2:31   ` Yafang Shao
  2025-07-30  9:58     ` David Hildenbrand
  0 siblings, 1 reply; 13+ messages in thread
From: Yafang Shao @ 2025-07-30  2:31 UTC (permalink / raw)
  To: Zi Yan
  Cc: akpm, david, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
	willy, ast, daniel, andrii, ameryhung, bpf, linux-mm

On Tue, Jul 29, 2025 at 11:08 PM Zi Yan <ziy@nvidia.com> wrote:
>
> On 29 Jul 2025, at 5:18, Yafang Shao wrote:
>
> > Background
> > ----------
> >
> > Our production servers consistently configure THP to "never" due to
> > historical incidents caused by its behavior. Key issues include:
> > - Increased Memory Consumption
> >   THP significantly raises overall memory usage, reducing available memory
> >   for workloads.
> >
> > - Latency Spikes
> >   Random latency spikes occur due to frequent memory compaction triggered
> >   by THP.
> >
> > - Lack of Fine-Grained Control
> >   THP tuning is globally configured, making it unsuitable for containerized
> >   environments. When multiple workloads share a host, enabling THP without
> >   per-workload control leads to unpredictable behavior.
> >
> > Due to these issues, administrators avoid switching to madvise or always
> > modes—unless per-workload THP control is implemented.
> >
> > To address this, we propose BPF-based THP policy for flexible adjustment.
> > Additionally, as David mentioned [0], this mechanism can also serve as a
>
> The link to [0] is missing. :)

I forgot to add it:
https://lwn.net/ml/all/9bc57721-5287-416c-aa30-46932d605f63@redhat.com/

>
> > policy prototyping tool (test policies via BPF before upstreaming them).
> >
> > Proposed Solution
> > -----------------
> >
> > As suggested by David [0], we introduce a new BPF interface:
> >
> > /**
> >  * @get_suggested_order: Get the suggested highest THP order for allocation
> >  * @mm: mm_struct associated with the THP allocation
> >  * @tva_flags: TVA flags for current context
> >  *             %TVA_IN_PF: Set when in page fault context
> >  *             Other flags: Reserved for future use
> >  * @order: The highest order being considered for this THP allocation.
> >  *         %PUD_ORDER for PUD-mapped allocations
>
> There is no PUD THP yet and the highest THP order is PMD_ORDER. It is better
> to remove the line above to avoid confusion.

Thanks for catching that. I’ll remove it.

>
> >  *         %PMD_ORDER for PMD-mapped allocations
> >  *         %PMD_ORDER - 1 for mTHP allocations
> >  *
> >  * Rerurn: Suggested highest THP order to use for allocation. The returned
> >  * order will never exceed the input @order value.
> >  */
> > int (*get_suggested_order)(struct mm_struct *mm, unsigned long tva_flags, int order);
> >
> > This interface:
> > - Supports both use cases (per-workload tuning + policy prototyping).
> > - Can be extended with BPF helpers (e.g., for memory pressure awareness).
>
> IIRC, your initial RFC works at VMA level, but this patch targets mm level.
> Is mm sufficient for your use case?

Yes, mm is sufficient for our use cases.
We've already deployed a variant of this patchset in our production
environment, and it has been performing well under our workloads.

> Are you planning to extend the
> BFP interface to VMA in the future? Just curious.

Our use cases don’t currently require the VMA.
We can add it later if a clear need arises.

--
Regards

Yafang


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v4 1/4] mm: thp: add support for BPF based THP order selection
  2025-07-29 15:32   ` Zi Yan
@ 2025-07-30  2:36     ` Yafang Shao
  0 siblings, 0 replies; 13+ messages in thread
From: Yafang Shao @ 2025-07-30  2:36 UTC (permalink / raw)
  To: Zi Yan
  Cc: akpm, david, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
	willy, ast, daniel, andrii, ameryhung, bpf, linux-mm

On Tue, Jul 29, 2025 at 11:32 PM Zi Yan <ziy@nvidia.com> wrote:
>
> On 29 Jul 2025, at 5:18, Yafang Shao wrote:
>
> > This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
> > THP tuning. It includes a hook get_suggested_order() [0], allowing BPF
> > programs to influence THP order selection based on factors such as:
> > - Workload identity
> >   For example, workloads running in specific containers or cgroups.
> > - Allocation context
> >   Whether the allocation occurs during a page fault, khugepaged, or other
> >   paths.
> > - System memory pressure
> >   (May require new BPF helpers to accurately assess memory pressure.)
> >
> > Key Details:
> > - Only one BPF program can be attached at a time, but it can be updated
> >   dynamically to adjust the policy.
> > - Supports automatic mTHP order selection and per-workload THP policies.
> > - Only functional when THP is set to madise or always.
> >
> > Experimental Status:
> > - Requires CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION to enable. [1]
> > - This feature is unstable and may evolve in future kernel versions.
> >
> > Link: https://lwn.net/ml/all/9bc57721-5287-416c-aa30-46932d605f63@redhat.com/ [0]
> > Link: https://lwn.net/ml/all/dda67ea5-2943-497c-a8e5-d81f0733047d@lucifer.local/ [1]
> >
> > Suggested-by: David Hildenbrand <david@redhat.com>
> > Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  include/linux/huge_mm.h    |  13 +++
> >  include/linux/khugepaged.h |  12 ++-
> >  mm/Kconfig                 |  12 +++
> >  mm/Makefile                |   1 +
> >  mm/bpf_thp.c               | 172 +++++++++++++++++++++++++++++++++++++
> >  mm/huge_memory.c           |   9 ++
> >  mm/khugepaged.c            |  18 +++-
> >  mm/memory.c                |  14 ++-
> >  8 files changed, 244 insertions(+), 7 deletions(-)
> >  create mode 100644 mm/bpf_thp.c
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 2f190c90192d..5a1527b3b6f0 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -6,6 +6,8 @@
> >
> >  #include <linux/fs.h> /* only for vma_is_dax() */
> >  #include <linux/kobject.h>
> > +#include <linux/pgtable.h>
> > +#include <linux/mm.h>
> >
> >  vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
> >  int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > @@ -54,6 +56,7 @@ enum transparent_hugepage_flag {
> >       TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
> >       TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
> >       TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
> > +     TRANSPARENT_HUGEPAGE_BPF_ATTACHED,      /* BPF prog is attached */
> >  };
> >
> >  struct kobject;
> > @@ -190,6 +193,16 @@ static inline bool hugepage_global_always(void)
> >                       (1<<TRANSPARENT_HUGEPAGE_FLAG);
> >  }
> >
> > +#ifdef CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION
> > +int get_suggested_order(struct mm_struct *mm, unsigned long tva_flags, int order);
> > +#else
> > +static inline int
> > +get_suggested_order(struct mm_struct *mm, unsigned long tva_flags, int order)
> > +{
> > +     return order;
> > +}
> > +#endif
> > +
> >  static inline int highest_order(unsigned long orders)
> >  {
> >       return fls_long(orders) - 1;
> > diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> > index b8d69cfbb58b..e0242968a020 100644
> > --- a/include/linux/khugepaged.h
> > +++ b/include/linux/khugepaged.h
> > @@ -2,6 +2,8 @@
> >  #ifndef _LINUX_KHUGEPAGED_H
> >  #define _LINUX_KHUGEPAGED_H
> >
> > +#include <linux/huge_mm.h>
> > +
> >  extern unsigned int khugepaged_max_ptes_none __read_mostly;
> >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >  extern struct attribute_group khugepaged_attr_group;
> > @@ -20,7 +22,15 @@ extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> >
> >  static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
> >  {
> > -     if (test_bit(MMF_VM_HUGEPAGE, &oldmm->flags))
> > +     /*
> > +      * THP allocation policy can be dynamically modified via BPF. If a
> > +      * long-lived task was previously allowed to allocate THP but is no
> > +      * longer permitted under the new policy, we must ensure its forked
> > +      * child processes also inherit this restriction.
>
> The comment is probably better to be:
>
> THP allocation policy can be dynamically modified via BPF. Even if a task
> was allowed to allocate THPs, BPF can decide whether its forked child
> can allocate THPs.
>
> The MMF_VM_HUGEPAGE flag will be cleared by khugepaged.
>
> Because the code here just wants to change a forked child’s mm flag. It has
> nothing to do with its parent THP policy.

Thanks for the improvement. I will change it.

>
> > +      * The MMF_VM_HUGEPAGE flag will be cleared by khugepaged.
> > +      */
> > +     if (test_bit(MMF_VM_HUGEPAGE, &oldmm->flags) &&
> > +         get_suggested_order(mm, 0, PMD_ORDER) == PMD_ORDER)
>
> Will it work for mTHPs? Nico is adding mTHP support for khugepaged[1].
> What if a BPF program wants khugepaged to work on some mTHP orders.
>
> Maybe get_suggested_order() should accept a bitmask of all allowed
> orders and return a bitmask as well. Only if the returned bitmask
> is 0, khugepaged is not entered.
>
> [1] https://lore.kernel.org/linux-mm/20250714003207.113275-1-npache@redhat.com/

Thanks for the information.
It seems extending this to use a bitmask would better accommodate
future changes.
I’ll give it some thought.

>
> >               __khugepaged_enter(mm);
> >  }
> >
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 781be3240e21..5d05a537ecde 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -908,6 +908,18 @@ config NO_PAGE_MAPCOUNT
> >
> >         EXPERIMENTAL because the impact of some changes is still unclear.
> >
> > +config EXPERIMENTAL_BPF_ORDER_SELECTION
> > +     bool "BPF-based THP order selection (EXPERIMENTAL)"
> > +     depends on TRANSPARENT_HUGEPAGE && BPF_SYSCALL
> > +
> > +     help
> > +       Enable dynamic THP order selection using BPF programs. This
> > +       experimental feature allows custom BPF logic to determine optimal
> > +       transparent hugepage allocation sizes at runtime.
> > +
> > +       Warning: This feature is unstable and may change in future kernel
> > +       versions.
> > +
> >  endif # TRANSPARENT_HUGEPAGE
> >
> >  # simple helper to make the code a bit easier to read
> > diff --git a/mm/Makefile b/mm/Makefile
> > index 1a7a11d4933d..562525e6a28a 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -99,6 +99,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
> >  obj-$(CONFIG_NUMA) += memory-tiers.o
> >  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
> >  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> > +obj-$(CONFIG_EXPERIMENTAL_BPF_ORDER_SELECTION) += bpf_thp.o
> >  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
> >  obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
> >  obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
> > diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
> > new file mode 100644
> > index 000000000000..10b486dd8bc4
> > --- /dev/null
> > +++ b/mm/bpf_thp.c
> > @@ -0,0 +1,172 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +#include <linux/bpf.h>
> > +#include <linux/btf.h>
> > +#include <linux/huge_mm.h>
> > +#include <linux/khugepaged.h>
> > +
> > +struct bpf_thp_ops {
> > +     /**
> > +      * @get_suggested_order: Get the suggested highest THP order for allocation
> > +      * @mm: mm_struct associated with the THP allocation
> > +      * @tva_flags: TVA flags for current context
> > +      *             %TVA_IN_PF: Set when in page fault context
> > +      *             Other flags: Reserved for future use
> > +      * @order: The highest order being considered for this THP allocation.
> > +      *         %PUD_ORDER for PUD-mapped allocations
>
> Like I mentioned in the cover letter, PMD_ORDER is the highest order
> mm currently supports. I wonder if it is better to be a bitmask of orders
> to better support mTHP.

I’ll look into it.

Regards
Yafang


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v4 4/4] selftest/bpf: add selftest for BPF based THP order seletection
  2025-07-29 15:36   ` Zi Yan
@ 2025-07-30  2:38     ` Yafang Shao
  0 siblings, 0 replies; 13+ messages in thread
From: Yafang Shao @ 2025-07-30  2:38 UTC (permalink / raw)
  To: Zi Yan
  Cc: akpm, david, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
	willy, ast, daniel, andrii, ameryhung, bpf, linux-mm

On Tue, Jul 29, 2025 at 11:36 PM Zi Yan <ziy@nvidia.com> wrote:
>
> On 29 Jul 2025, at 5:18, Yafang Shao wrote:
>
> > This self-test verifies that PMD-mapped THP allocation is restricted in
> > page faults for tasks within a specific cgroup, while still permitting
> > THP allocation via khugepaged.
> >
> > Since THP allocation depends on various factors (e.g., system memory
> > pressure), using the actual allocated THP size for validation is
> > unreliable. Instead, we check the return value of get_suggested_order(),
> > which indicates whether the system intends to allocate a THP, regardless of
> > whether the allocation ultimately succeeds.
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  tools/testing/selftests/bpf/config            |   2 +
> >  .../selftests/bpf/prog_tests/thp_adjust.c     | 183 ++++++++++++++++++
> >  .../selftests/bpf/progs/test_thp_adjust.c     |  69 +++++++
> >  .../bpf/progs/test_thp_adjust_failure.c       |  24 +++
> >  4 files changed, 278 insertions(+)
> >  create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
> >  create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c
> >  create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_failure.c
> >
>
> The program below will only work on architectures with 4KB base page
> and PMD order is 9. It is better to read base page size and PMD page size
> from the system.

Thanks for the suggestion. I’ll change it.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v4 0/4] mm, bpf: BPF based THP order selection
  2025-07-30  2:31   ` Yafang Shao
@ 2025-07-30  9:58     ` David Hildenbrand
  2025-07-31  2:07       ` Yafang Shao
  0 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand @ 2025-07-30  9:58 UTC (permalink / raw)
  To: Yafang Shao, Zi Yan
  Cc: akpm, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
	willy, ast, daniel, andrii, ameryhung, bpf, linux-mm


> 
> Yes, mm is sufficient for our use cases.
> We've already deployed a variant of this patchset in our production
> environment, and it has been performing well under our workloads.
> 
>> Are you planning to extend the
>> BFP interface to VMA in the future? Just curious.
> 
> Our use cases don’t currently require the VMA.
> We can add it later if a clear need arises.

We should start immediately with a VMA-based approach as discussed 
previously.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH v4 0/4] mm, bpf: BPF based THP order selection
  2025-07-30  9:58     ` David Hildenbrand
@ 2025-07-31  2:07       ` Yafang Shao
  0 siblings, 0 replies; 13+ messages in thread
From: Yafang Shao @ 2025-07-31  2:07 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Zi Yan, akpm, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
	willy, ast, daniel, andrii, ameryhung, bpf, linux-mm

On Wed, Jul 30, 2025 at 5:58 PM David Hildenbrand <david@redhat.com> wrote:
>
>
> >
> > Yes, mm is sufficient for our use cases.
> > We've already deployed a variant of this patchset in our production
> > environment, and it has been performing well under our workloads.
> >
> >> Are you planning to extend the
> >> BFP interface to VMA in the future? Just curious.
> >
> > Our use cases don’t currently require the VMA.
> > We can add it later if a clear need arises.
>
> We should start immediately with a VMA-based approach as discussed
> previously.

Got it, I’ll update that.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-07-31  2:07 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-29  9:18 [RFC PATCH v4 0/4] mm, bpf: BPF based THP order selection Yafang Shao
2025-07-29  9:18 ` [RFC PATCH v4 1/4] mm: thp: add support for " Yafang Shao
2025-07-29 15:32   ` Zi Yan
2025-07-30  2:36     ` Yafang Shao
2025-07-29  9:18 ` [RFC PATCH v4 2/4] mm: thp: add a new kfunc bpf_mm_get_mem_cgroup() Yafang Shao
2025-07-29  9:18 ` [RFC PATCH v4 3/4] mm: thp: add a new kfunc bpf_mm_get_task() Yafang Shao
2025-07-29  9:18 ` [RFC PATCH v4 4/4] selftest/bpf: add selftest for BPF based THP order seletection Yafang Shao
2025-07-29 15:36   ` Zi Yan
2025-07-30  2:38     ` Yafang Shao
2025-07-29 15:07 ` [RFC PATCH v4 0/4] mm, bpf: BPF based THP order selection Zi Yan
2025-07-30  2:31   ` Yafang Shao
2025-07-30  9:58     ` David Hildenbrand
2025-07-31  2:07       ` Yafang Shao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).