bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment
@ 2025-06-08  7:35 Yafang Shao
  2025-06-08  7:35 ` [RFC PATCH v3 1/5] mm, thp: use __thp_vma_allowable_orders() in khugepaged_enter_vma() Yafang Shao
                   ` (6 more replies)
  0 siblings, 7 replies; 33+ messages in thread
From: Yafang Shao @ 2025-06-08  7:35 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii
  Cc: bpf, linux-mm, Yafang Shao

Background
----------

We have consistently configured THP to "never" on our production servers
due to past incidents caused by its behavior:

- Increased memory consumption
  THP significantly raises overall memory usage.

- Latency spikes
  Random latency spikes occur due to more frequent memory compaction
  activity triggered by THP.

- Lack of Fine-Grained Control
  THP tuning knobs are globally configured, making them unsuitable for
  containerized environments. When different workloads run on the same
  host, enabling THP globally (without per-workload control) can cause
  unpredictable behavior.

Due to these issues, system administrators remain hesitant to switch to
"madvise" or "always" modes—unless finer-grained control over THP
behavior is implemented.

New Motivation
--------------

We have now identified that certain AI workloads achieve substantial
performance gains with THP enabled. However, we’ve also verified that some
workloads see little to no benefit—or are even negatively impacted—by THP.

In our Kubernetes environment, we deploy mixed workloads on a single server
to maximize resource utilization. Our goal is to selectively enable THP for
services that benefit from it while keeping it disabled for others. This
approach allows us to incrementally enable THP for additional services and
assess how to make it more viable in production.

Proposed Solution
-----------------

To enable fine-grained control over THP behavior, we propose dynamically
adjusting THP policies using BPF. This approach allows per-workload THP
tuning, providing greater flexibility and precision.

The BPF-based THP adjustment mechanism introduces two new APIs for granular
policy control:

- THP allocator

  int (*allocator)(unsigned long vm_flags, unsigned long tva_flags);

  The BPF program returns either THP_ALLOC_CURRENT or THP_ALLOC_KHUGEPAGED,
  indicating whether THP allocation should be performed synchronously
  (current task) or asynchronously (khugepaged).

  The decision is based on the current task context, VMA flags, and TVA
  flags.

- THP reclaimer

  int (*reclaimer)(bool vma_madvised);

  The BPF program returns either RECLAIMER_CURRENT or RECLAIMER_KSWAPD,
  determining whether memory reclamation is handled by the current task or
  kswapd.

We may explore implementing fine-grained tuning for khugepaged in future
iterations.

Alternative Proposals
---------------------

- Gutierrez’s cgroup-based approach [1]
  - Proposed adding a new cgroup file to control THP policy.
  - However, as Johannes noted, cgroups are designed for hierarchical
    resource allocation, not arbitrary policy settings [2].

- Usama’s per-task THP proposal based on prctl() [3]:
  - Enabling THP per task via prctl().
  - This provides an alternative approach for per-workload THP tuning,
    though it lacks dynamic policy adjustment capabilities and thus offers
    limited flexibility.

This is currently a PoC implementation with limited test. Feedback of any
kind is welcome.

Link: https://lore.kernel.org/linux-mm/20241030083311.965933-1-gutierrez.asier@huawei-partners.com/ [1]
Link: https://lore.kernel.org/linux-mm/20250430175954.GD2020@cmpxchg.org/ [2]
Link: https://lore.kernel.org/linux-mm/20250519223307.3601786-1-usamaarif642@gmail.com/ [3]

RFC v2->v3:
Thanks to the valuable input from David and Lorenzo:
- Finer-graind tuning based on madvise or always mode
- Use BPF to write more advanced policies / allocation logic

RFC v1->v2: https://lwn.net/Articles/1021783/
The main changes are as follows,
- Use struct_ops instead of fmod_ret (Alexei)
- Introduce a new THP mode (Johannes)
- Introduce new helpers for BPF hook (Zi)
- Refine the commit log

RFC v1: https://lwn.net/Articles/1019290/

Yafang Shao (5):
  mm, thp: use __thp_vma_allowable_orders() in khugepaged_enter_vma()
  mm, thp: add bpf thp hook to determine thp allocator
  mm, thp: add bpf thp hook to determine thp reclaimer
  mm: thp: add bpf thp struct ops
  selftests/bpf: Add selftest for THP adjustment

 include/linux/huge_mm.h                       |   8 +
 mm/Makefile                                   |   3 +
 mm/bpf_thp.c                                  | 184 ++++++++++++++++++
 mm/huge_memory.c                              |   5 +
 mm/khugepaged.c                               |   6 +-
 tools/testing/selftests/bpf/config            |   1 +
 .../selftests/bpf/prog_tests/thp_adjust.c     | 158 +++++++++++++++
 .../selftests/bpf/progs/test_thp_adjust.c     |  38 ++++
 8 files changed, 401 insertions(+), 2 deletions(-)
 create mode 100644 mm/bpf_thp.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c

-- 
2.43.5


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC PATCH v3 1/5] mm, thp: use __thp_vma_allowable_orders() in khugepaged_enter_vma()
  2025-06-08  7:35 [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment Yafang Shao
@ 2025-06-08  7:35 ` Yafang Shao
  2025-07-17 14:48   ` Usama Arif
  2025-06-08  7:35 ` [RFC PATCH v3 2/5] mm, thp: add bpf thp hook to determine thp allocator Yafang Shao
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 33+ messages in thread
From: Yafang Shao @ 2025-06-08  7:35 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii
  Cc: bpf, linux-mm, Yafang Shao

The order has already been validated in hugepage_pmd_enabled(), so there's
no need to recheck it in thp_vma_allowable_orders().

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 mm/khugepaged.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 15203ea7d007..79e208999ddb 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -474,8 +474,8 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
 {
 	if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) &&
 	    hugepage_pmd_enabled()) {
-		if (thp_vma_allowable_order(vma, vm_flags, TVA_ENFORCE_SYSFS,
-					    PMD_ORDER))
+		if (__thp_vma_allowable_orders(vma, vm_flags, TVA_ENFORCE_SYSFS,
+					       BIT(PMD_ORDER)))
 			__khugepaged_enter(vma->vm_mm);
 	}
 }
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH v3 2/5] mm, thp: add bpf thp hook to determine thp allocator
  2025-06-08  7:35 [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment Yafang Shao
  2025-06-08  7:35 ` [RFC PATCH v3 1/5] mm, thp: use __thp_vma_allowable_orders() in khugepaged_enter_vma() Yafang Shao
@ 2025-06-08  7:35 ` Yafang Shao
  2025-07-17 15:30   ` Usama Arif
  2025-06-08  7:35 ` [RFC PATCH v3 3/5] mm, thp: add bpf thp hook to determine thp reclaimer Yafang Shao
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 33+ messages in thread
From: Yafang Shao @ 2025-06-08  7:35 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii
  Cc: bpf, linux-mm, Yafang Shao

A new hook bpf_thp_allocator() is added to determine if the THP is
allocated by khugepaged or by the current task.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/huge_mm.h | 10 ++++++++++
 mm/khugepaged.c         |  2 ++
 2 files changed, 12 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2f190c90192d..db2eadd3f65b 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -190,6 +190,14 @@ static inline bool hugepage_global_always(void)
 			(1<<TRANSPARENT_HUGEPAGE_FLAG);
 }
 
+#define THP_ALLOC_KHUGEPAGED (1 << 1)
+#define THP_ALLOC_CURRENT (1 << 2)
+static inline int bpf_thp_allocator(unsigned long vm_flags,
+				     unsigned long tva_flags)
+{
+	return THP_ALLOC_KHUGEPAGED | THP_ALLOC_CURRENT;
+}
+
 static inline int highest_order(unsigned long orders)
 {
 	return fls_long(orders) - 1;
@@ -290,6 +298,8 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
 	if ((tva_flags & TVA_ENFORCE_SYSFS) && vma_is_anonymous(vma)) {
 		unsigned long mask = READ_ONCE(huge_anon_orders_always);
 
+		if (!(bpf_thp_allocator(vm_flags, tva_flags) & THP_ALLOC_CURRENT))
+			return 0;
 		if (vm_flags & VM_HUGEPAGE)
 			mask |= READ_ONCE(huge_anon_orders_madvise);
 		if (hugepage_global_always() ||
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 79e208999ddb..18f800fe7335 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -472,6 +472,8 @@ void __khugepaged_enter(struct mm_struct *mm)
 void khugepaged_enter_vma(struct vm_area_struct *vma,
 			  unsigned long vm_flags)
 {
+	if (!(bpf_thp_allocator(vm_flags, 0) & THP_ALLOC_KHUGEPAGED))
+		return;
 	if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) &&
 	    hugepage_pmd_enabled()) {
 		if (__thp_vma_allowable_orders(vma, vm_flags, TVA_ENFORCE_SYSFS,
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH v3 3/5] mm, thp: add bpf thp hook to determine thp reclaimer
  2025-06-08  7:35 [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment Yafang Shao
  2025-06-08  7:35 ` [RFC PATCH v3 1/5] mm, thp: use __thp_vma_allowable_orders() in khugepaged_enter_vma() Yafang Shao
  2025-06-08  7:35 ` [RFC PATCH v3 2/5] mm, thp: add bpf thp hook to determine thp allocator Yafang Shao
@ 2025-06-08  7:35 ` Yafang Shao
  2025-07-17 16:06   ` Usama Arif
  2025-06-08  7:35 ` [RFC PATCH v3 4/5] mm: thp: add bpf thp struct ops Yafang Shao
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 33+ messages in thread
From: Yafang Shao @ 2025-06-08  7:35 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii
  Cc: bpf, linux-mm, Yafang Shao

A new hook, bpf_thp_gfp_mask(), is introduced to determine whether memory
reclamation is being performed by the current task or by kswapd.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/huge_mm.h | 5 +++++
 mm/huge_memory.c        | 5 +++++
 2 files changed, 10 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index db2eadd3f65b..6a40ebf25f5c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -198,6 +198,11 @@ static inline int bpf_thp_allocator(unsigned long vm_flags,
 	return THP_ALLOC_KHUGEPAGED | THP_ALLOC_CURRENT;
 }
 
+static inline gfp_t bpf_thp_gfp_mask(bool vma_madvised)
+{
+	return 0;
+}
+
 static inline int highest_order(unsigned long orders)
 {
 	return fls_long(orders) - 1;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d3e66136e41a..81c1711d13fa 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1280,6 +1280,11 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 gfp_t vma_thp_gfp_mask(struct vm_area_struct *vma)
 {
 	const bool vma_madvised = vma && (vma->vm_flags & VM_HUGEPAGE);
+	gfp_t gfp_mask;
+
+	gfp_mask = bpf_thp_gfp_mask(vma_madvised);
+	if (gfp_mask)
+		return gfp_mask;
 
 	/* Always do synchronous compaction */
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH v3 4/5] mm: thp: add bpf thp struct ops
  2025-06-08  7:35 [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment Yafang Shao
                   ` (2 preceding siblings ...)
  2025-06-08  7:35 ` [RFC PATCH v3 3/5] mm, thp: add bpf thp hook to determine thp reclaimer Yafang Shao
@ 2025-06-08  7:35 ` Yafang Shao
  2025-07-17 16:25   ` Usama Arif
  2025-07-17 18:21   ` Amery Hung
  2025-06-08  7:35 ` [RFC PATCH v3 5/5] selftests/bpf: Add selftest for THP adjustment Yafang Shao
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 33+ messages in thread
From: Yafang Shao @ 2025-06-08  7:35 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii
  Cc: bpf, linux-mm, Yafang Shao

A new bpf_thp struct ops is introduced to provide finer-grained control
over THP allocation policy. The struct ops includes two APIs for
determining the THP allocator and reclaimer behavior:

- THP allocator

  int (*allocator)(unsigned long vm_flags, unsigned long tva_flags);

  The BPF program returns either THP_ALLOC_CURRENT or THP_ALLOC_KHUGEPAGED,
  indicating whether THP allocation should be performed synchronously
  (current task) or asynchronously (khugepaged).

  The decision is based on the current task context, VMA flags, and TVA
  flags.

- THP reclaimer

  int (*reclaimer)(bool vma_madvised);

  The BPF program returns either RECLAIMER_CURRENT or RECLAIMER_KSWAPD,
  determining whether memory reclamation is handled by the current task or
  kswapd.

  The decision depends on the current task and VMA flags.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 include/linux/huge_mm.h |  13 +--
 mm/Makefile             |   3 +
 mm/bpf_thp.c            | 184 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 190 insertions(+), 10 deletions(-)
 create mode 100644 mm/bpf_thp.c

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 6a40ebf25f5c..0d02c9b56a85 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -54,6 +54,7 @@ enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
 	TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
+	TRANSPARENT_HUGEPAGE_BPF_ATTACHED,	/* BPF prog is attached */
 };
 
 struct kobject;
@@ -192,16 +193,8 @@ static inline bool hugepage_global_always(void)
 
 #define THP_ALLOC_KHUGEPAGED (1 << 1)
 #define THP_ALLOC_CURRENT (1 << 2)
-static inline int bpf_thp_allocator(unsigned long vm_flags,
-				     unsigned long tva_flags)
-{
-	return THP_ALLOC_KHUGEPAGED | THP_ALLOC_CURRENT;
-}
-
-static inline gfp_t bpf_thp_gfp_mask(bool vma_madvised)
-{
-	return 0;
-}
+int bpf_thp_allocator(unsigned long vm_flags, unsigned long tva_flags);
+gfp_t bpf_thp_gfp_mask(bool vma_madvised);
 
 static inline int highest_order(unsigned long orders)
 {
diff --git a/mm/Makefile b/mm/Makefile
index 1a7a11d4933d..e5f41cf3fd61 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -99,6 +99,9 @@ obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_NUMA) += memory-tiers.o
 obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
+ifdef CONFIG_BPF_SYSCALL
+obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += bpf_thp.o
+endif
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
 obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
 obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
new file mode 100644
index 000000000000..894d6cb93107
--- /dev/null
+++ b/mm/bpf_thp.c
@@ -0,0 +1,184 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/huge_mm.h>
+#include <linux/khugepaged.h>
+
+#define RECLAIMER_CURRENT (1 << 1)
+#define RECLAIMER_KSWAPD (1 << 2)
+#define RECLAIMER_BOTH (RECLAIMER_CURRENT | RECLAIMER_KSWAPD)
+
+struct bpf_thp_ops {
+	/**
+	 * @allocator: Specifies whether the THP allocation is performed
+	 * by the current task or by khugepaged.
+	 * @vm_flags: Flags for the VMA in the current allocation context
+	 * @tva_flags: Flags for the TVA in the current allocation context
+	 *
+	 * Rerurn:
+	 * - THP_ALLOC_CURRENT: THP was allocated synchronously by the calling
+	 *   task's context.
+	 * - THP_ALLOC_KHUGEPAGED: THP was allocated asynchronously by the
+	 *   khugepaged kernel thread.
+	 * - 0: THP allocation is disallowed in the current context.
+	 */
+	int (*allocator)(unsigned long vm_flags, unsigned long tva_flags);
+	/**
+	 * @reclaimer: Specifies the entity performing page reclaim:
+	 *             - current task context
+	 *             - kswapd
+	 *             - none (no reclaim)
+	 * @vma_madvised: MADV flags for this VMA (e.g., MADV_HUGEPAGE, MADV_NOHUGEPAGE)
+	 *
+	 * Return:
+	 * - RECLAIMER_CURRENT: Direct reclaim by the current task if THP
+	 *   allocation fails.
+	 * - RECLAIMER_KSWAPD: Wake kswapd to reclaim memory if THP allocation fails.
+	 * - RECLAIMER_ALL: Both current and kswapd will perform the reclaim
+	 * - 0: No reclaim will be attempted.
+	 */
+	int (*reclaimer)(bool vma_madvised);
+};
+
+static struct bpf_thp_ops bpf_thp;
+
+int bpf_thp_allocator(unsigned long vm_flags, unsigned long tva_flags)
+{
+	int allocator;
+
+	/* No BPF program is attached */
+	if (!(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_BPF_ATTACHED)))
+		return THP_ALLOC_KHUGEPAGED | THP_ALLOC_CURRENT;
+
+	if (current_is_khugepaged())
+		return THP_ALLOC_KHUGEPAGED | THP_ALLOC_CURRENT;
+	if (!bpf_thp.allocator)
+		return THP_ALLOC_KHUGEPAGED | THP_ALLOC_CURRENT;
+
+	allocator = bpf_thp.allocator(vm_flags, tva_flags);
+	if (!allocator)
+		return 0;
+	/* invalid return value */
+	if (allocator & ~(THP_ALLOC_KHUGEPAGED | THP_ALLOC_CURRENT))
+		return THP_ALLOC_KHUGEPAGED | THP_ALLOC_CURRENT;
+	return allocator;
+}
+
+gfp_t bpf_thp_gfp_mask(bool vma_madvised)
+{
+	int reclaimer;
+
+	if (!(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_BPF_ATTACHED)))
+		return 0;
+
+	if (!bpf_thp.reclaimer)
+		return 0;
+
+	reclaimer = bpf_thp.reclaimer(vma_madvised);
+	switch (reclaimer) {
+	case RECLAIMER_CURRENT:
+		return GFP_TRANSHUGE | __GFP_NORETRY;
+	case RECLAIMER_KSWAPD:
+		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
+	case RECLAIMER_BOTH:
+		return GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM | __GFP_NORETRY;
+	default:
+		return 0;
+	}
+}
+
+static bool bpf_thp_ops_is_valid_access(int off, int size,
+					enum bpf_access_type type,
+					const struct bpf_prog *prog,
+					struct bpf_insn_access_aux *info)
+{
+	return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
+}
+
+static const struct bpf_func_proto *
+bpf_thp_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	return bpf_base_func_proto(func_id, prog);
+}
+
+static const struct bpf_verifier_ops thp_bpf_verifier_ops = {
+	.get_func_proto = bpf_thp_get_func_proto,
+	.is_valid_access = bpf_thp_ops_is_valid_access,
+};
+
+static int bpf_thp_reg(void *kdata, struct bpf_link *link)
+{
+	struct bpf_thp_ops *ops = kdata;
+
+	/* TODO: add support for multiple attaches */
+	if (test_and_set_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
+		&transparent_hugepage_flags))
+		return -EOPNOTSUPP;
+	bpf_thp.allocator = ops->allocator;
+	bpf_thp.reclaimer = ops->reclaimer;
+	return 0;
+}
+
+static void bpf_thp_unreg(void *kdata, struct bpf_link *link)
+{
+	clear_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, &transparent_hugepage_flags);
+	bpf_thp.allocator = NULL;
+	bpf_thp.reclaimer = NULL;
+}
+
+static int bpf_thp_check_member(const struct btf_type *t,
+				const struct btf_member *member,
+				const struct bpf_prog *prog)
+{
+	return 0;
+}
+
+static int bpf_thp_init_member(const struct btf_type *t,
+			       const struct btf_member *member,
+			       void *kdata, const void *udata)
+{
+	return 0;
+}
+
+static int bpf_thp_init(struct btf *btf)
+{
+	return 0;
+}
+
+static int allocator(unsigned long vm_flags, unsigned long tva_flags)
+{
+	return 0;
+}
+
+static int reclaimer(bool vma_madvised)
+{
+	return 0;
+}
+
+static struct bpf_thp_ops __bpf_thp_ops = {
+	.allocator = allocator,
+	.reclaimer = reclaimer,
+};
+
+static struct bpf_struct_ops bpf_bpf_thp_ops = {
+	.verifier_ops = &thp_bpf_verifier_ops,
+	.init = bpf_thp_init,
+	.check_member = bpf_thp_check_member,
+	.init_member = bpf_thp_init_member,
+	.reg = bpf_thp_reg,
+	.unreg = bpf_thp_unreg,
+	.name = "bpf_thp_ops",
+	.cfi_stubs = &__bpf_thp_ops,
+	.owner = THIS_MODULE,
+};
+
+static int __init bpf_thp_ops_init(void)
+{
+	int err = register_bpf_struct_ops(&bpf_bpf_thp_ops, bpf_thp_ops);
+
+	if (err)
+		pr_err("bpf_thp: Failed to register struct_ops (%d)\n", err);
+	return err;
+}
+late_initcall(bpf_thp_ops_init);
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC PATCH v3 5/5] selftests/bpf: Add selftest for THP adjustment
  2025-06-08  7:35 [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment Yafang Shao
                   ` (3 preceding siblings ...)
  2025-06-08  7:35 ` [RFC PATCH v3 4/5] mm: thp: add bpf thp struct ops Yafang Shao
@ 2025-06-08  7:35 ` Yafang Shao
  2025-07-15 22:42 ` [RFC PATCH v3 0/5] mm, bpf: BPF based " David Hildenbrand
  2025-07-17 16:35 ` Usama Arif
  6 siblings, 0 replies; 33+ messages in thread
From: Yafang Shao @ 2025-06-08  7:35 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii
  Cc: bpf, linux-mm, Yafang Shao

This test case uses a BPF program to enforce the following THP allocation
policy:
- Current task will wakeup khugepaged to allocate THP

The result is as follows,
  $ ./test_progs --name="thp_adjust"
  #437     thp_adjust:OK
  Summary: 1/0 PASSED, 0 SKIPPED, 0 FAILED

CONFIG_TRANSPARENT_HUGEPAGE=y is required for this test.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 tools/testing/selftests/bpf/config            |   1 +
 .../selftests/bpf/prog_tests/thp_adjust.c     | 158 ++++++++++++++++++
 .../selftests/bpf/progs/test_thp_adjust.c     |  38 +++++
 3 files changed, 197 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c

diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
index f74e1ea0ad3b..1c3c44fd536d 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -118,3 +118,4 @@ CONFIG_XDP_SOCKETS=y
 CONFIG_XFRM_INTERFACE=y
 CONFIG_TCP_CONG_DCTCP=y
 CONFIG_TCP_CONG_BBR=y
+CONFIG_TRANSPARENT_HUGEPAGE=y
diff --git a/tools/testing/selftests/bpf/prog_tests/thp_adjust.c b/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
new file mode 100644
index 000000000000..ee8a731f53d4
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
@@ -0,0 +1,158 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <sys/mman.h>
+#include <test_progs.h>
+#include "test_thp_adjust.skel.h"
+
+#define LEN (4 * 1024 * 1024) /* 4MB */
+#define THP_ENABLED_PATH "/sys/kernel/mm/transparent_hugepage/enabled"
+#define SMAPS_PATH "/proc/self/smaps"
+#define ANON_HUGE_PAGES "AnonHugePages:"
+
+static char *thp_addr;
+static char old_mode[32];
+
+int thp_mode_save(void)
+{
+	const char *start, *end;
+	char buf[128];
+	int fd, err;
+	size_t len;
+
+	fd = open(THP_ENABLED_PATH, O_RDONLY);
+	if (fd == -1)
+		return -1;
+
+	err = read(fd, buf, sizeof(buf) - 1);
+	if (err == -1)
+		goto close;
+
+	start = strchr(buf, '[');
+	end = start ? strchr(start, ']') : NULL;
+	if (!start || !end || end <= start) {
+		err = -1;
+		goto close;
+	}
+
+	len = end - start - 1;
+	if (len >= sizeof(old_mode))
+		len = sizeof(old_mode) - 1;
+	strncpy(old_mode, start + 1, len);
+	old_mode[len] = '\0';
+
+close:
+	close(fd);
+	return err;
+}
+
+int thp_set(const char *desired_mode)
+{
+	int fd, err;
+
+	fd = open(THP_ENABLED_PATH, O_RDWR);
+	if (fd == -1)
+		return -1;
+
+	err = write(fd, desired_mode, strlen(desired_mode));
+	close(fd);
+	return err;
+}
+
+int thp_reset(void)
+{
+	int fd, err;
+
+	fd = open(THP_ENABLED_PATH, O_WRONLY);
+	if (fd == -1)
+		return -1;
+
+	err = write(fd, old_mode, strlen(old_mode));
+	close(fd);
+	return err;
+}
+
+int thp_alloc(void)
+{
+	int err, i;
+
+	thp_addr = mmap(NULL, LEN, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
+	if (thp_addr == MAP_FAILED)
+		return -1;
+
+	for (i = 0; i < LEN; i += 4096)
+		thp_addr[i] = 1;
+
+	err = madvise(thp_addr, LEN, MADV_HUGEPAGE);
+	if (err == -1)
+		goto unmap;
+	return 0;
+
+unmap:
+	munmap(thp_addr, LEN);
+	return -1;
+}
+
+void thp_free(void)
+{
+	if (!thp_addr)
+		return;
+	munmap(thp_addr, LEN);
+}
+
+void test_thp_adjust(void)
+{
+	struct bpf_link *fentry_link, *ops_link;
+	struct test_thp_adjust *skel;
+	int err, first_calls;
+
+	if (!ASSERT_NEQ(thp_mode_save(), -1, "THP mode save"))
+		return;
+	if (!ASSERT_GE(thp_set("madvise"), 0, "THP mode set"))
+		return;
+
+	skel = test_thp_adjust__open();
+	if (!ASSERT_OK_PTR(skel, "open"))
+		goto thp_reset;
+
+	skel->bss->target_pid = getpid();
+
+	err = test_thp_adjust__load(skel);
+	if (!ASSERT_OK(err, "load"))
+		goto destroy;
+
+	fentry_link = bpf_program__attach_trace(skel->progs.thp_run);
+	if (!ASSERT_OK_PTR(fentry_link, "attach fentry"))
+		goto destroy;
+
+	ops_link = bpf_map__attach_struct_ops(skel->maps.thp);
+	if (!ASSERT_OK_PTR(ops_link, "attach struct_ops"))
+		goto destroy;
+
+	if (!ASSERT_NEQ(thp_alloc(), -1, "THP alloc"))
+		goto destroy;
+
+	/* After attaching struct_ops, THP will be allocated. */
+	if (!ASSERT_GT(skel->bss->khugepaged_enter, 0, "khugepaged enter"))
+		goto thp_free;
+
+	first_calls = skel->bss->khugepaged_enter;
+
+	thp_free();
+
+	if (!ASSERT_GE(thp_set("never"), 0, "THP set"))
+		goto destroy;
+
+	if (!ASSERT_NEQ(thp_alloc(), -1, "THP alloc"))
+		goto destroy;
+
+	/* In "never" mode, THP won't be allocated even if the prog is attached. */
+	if (!ASSERT_EQ(skel->bss->khugepaged_enter, first_calls, "khugepaged enter"))
+		goto thp_free;
+
+thp_free:
+	thp_free();
+destroy:
+	test_thp_adjust__destroy(skel);
+thp_reset:
+	ASSERT_GE(thp_reset(), 0, "THP mode reset");
+}
diff --git a/tools/testing/selftests/bpf/progs/test_thp_adjust.c b/tools/testing/selftests/bpf/progs/test_thp_adjust.c
new file mode 100644
index 000000000000..9a3d8bfcd124
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_thp_adjust.c
@@ -0,0 +1,38 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+#define THP_ALLOC_KHUGEPAGED (1<<1)
+
+int target_pid;
+int khugepaged_enter;
+
+SEC("fentry/__khugepaged_enter")
+int BPF_PROG(thp_run, struct mm_struct *mm)
+{
+	struct task_struct *current = bpf_get_current_task_btf();
+
+	if (current->mm == mm && current->pid == target_pid)
+		khugepaged_enter++;
+	return 0;
+}
+
+SEC("struct_ops/allocator")
+int BPF_PROG(bpf_thp_allocator)
+{
+	struct task_struct *current = bpf_get_current_task_btf();
+
+	/* Allocate THP for this task in khugepaged. */
+	if (current->pid == target_pid)
+		return THP_ALLOC_KHUGEPAGED;
+	return 0;
+}
+
+SEC(".struct_ops.link")
+struct bpf_thp_ops thp = {
+	.allocator = (void *)bpf_thp_allocator,
+};
-- 
2.43.5


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment
  2025-06-08  7:35 [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment Yafang Shao
                   ` (4 preceding siblings ...)
  2025-06-08  7:35 ` [RFC PATCH v3 5/5] selftests/bpf: Add selftest for THP adjustment Yafang Shao
@ 2025-07-15 22:42 ` David Hildenbrand
  2025-07-17  3:09   ` Yafang Shao
  2025-07-17 16:35 ` Usama Arif
  6 siblings, 1 reply; 33+ messages in thread
From: David Hildenbrand @ 2025-07-15 22:42 UTC (permalink / raw)
  To: Yafang Shao, akpm, ziy, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, npache, ryan.roberts, dev.jain, hannes,
	usamaarif642, gutierrez.asier, willy, ast, daniel, andrii
  Cc: bpf, linux-mm

On 08.06.25 09:35, Yafang Shao wrote:

Sorry for not replying earlier, I was caught up with all other stuff.

I still consider this a very interesting approach, although I think we 
should think more about what a reasonable policy would look like 
medoium-term (in particular, multiple THP sizes, not always falling back 
to small pages if it means splitting excessively in the buddy etc.)

> Background
> ----------
> 
> We have consistently configured THP to "never" on our production servers
> due to past incidents caused by its behavior:
> 
> - Increased memory consumption
>    THP significantly raises overall memory usage.
> 
> - Latency spikes
>    Random latency spikes occur due to more frequent memory compaction
>    activity triggered by THP.
> 
> - Lack of Fine-Grained Control
>    THP tuning knobs are globally configured, making them unsuitable for
>    containerized environments. When different workloads run on the same
>    host, enabling THP globally (without per-workload control) can cause
>    unpredictable behavior.
> 
> Due to these issues, system administrators remain hesitant to switch to
> "madvise" or "always" modes—unless finer-grained control over THP
> behavior is implemented.
> 
> New Motivation
> --------------
> 
> We have now identified that certain AI workloads achieve substantial
> performance gains with THP enabled. However, we’ve also verified that some
> workloads see little to no benefit—or are even negatively impacted—by THP.
> 
> In our Kubernetes environment, we deploy mixed workloads on a single server
> to maximize resource utilization. Our goal is to selectively enable THP for
> services that benefit from it while keeping it disabled for others. This
> approach allows us to incrementally enable THP for additional services and
> assess how to make it more viable in production.
> 
> Proposed Solution
> -----------------
> 
> To enable fine-grained control over THP behavior, we propose dynamically
> adjusting THP policies using BPF. This approach allows per-workload THP
> tuning, providing greater flexibility and precision.
> 
> The BPF-based THP adjustment mechanism introduces two new APIs for granular
> policy control:
> 
> - THP allocator
> 
>    int (*allocator)(unsigned long vm_flags, unsigned long tva_flags);
> 
>    The BPF program returns either THP_ALLOC_CURRENT or THP_ALLOC_KHUGEPAGED,
>    indicating whether THP allocation should be performed synchronously
>    (current task) or asynchronously (khugepaged).
> 
>    The decision is based on the current task context, VMA flags, and TVA
>    flags.

I think we should go one step further and actually get advises about the 
orders (THP sizes) to use. It might be helpful if the program would have 
access to system stats, to make an educated decision.

Given page fault information and system information, the program could 
then decide which orders to try to allocate.

That means, one would query during page faults and during khugepaged, 
which order one should try -- compared to our current approach of "start 
with the largest order that is enabled and fits".

> 
> - THP reclaimer
> 
>    int (*reclaimer)(bool vma_madvised);
> 
>    The BPF program returns either RECLAIMER_CURRENT or RECLAIMER_KSWAPD,
>    determining whether memory reclamation is handled by the current task or
>    kswapd.

Not sure about that, will have to look into the details.

But what could be interesting is deciding how to deal with underutilized 
THPs: for now we will try replacing zero-filled pages by the shared 
zeropage during a split. *maybe* some workloads could benefit from ... 
not doing that, and instead optimize the split.

Will maybe be a bit more trick, though.


-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment
  2025-07-15 22:42 ` [RFC PATCH v3 0/5] mm, bpf: BPF based " David Hildenbrand
@ 2025-07-17  3:09   ` Yafang Shao
  2025-07-17  8:52     ` David Hildenbrand
  0 siblings, 1 reply; 33+ messages in thread
From: Yafang Shao @ 2025-07-17  3:09 UTC (permalink / raw)
  To: David Hildenbrand, Matthew Wilcox
  Cc: akpm, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
	ast, daniel, andrii, bpf, linux-mm

On Wed, Jul 16, 2025 at 6:42 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 08.06.25 09:35, Yafang Shao wrote:
>
> Sorry for not replying earlier, I was caught up with all other stuff.
>
> I still consider this a very interesting approach, although I think we
> should think more about what a reasonable policy would look like
> medoium-term (in particular, multiple THP sizes, not always falling back
> to small pages if it means splitting excessively in the buddy etc.)

I find it difficult to understand why we introduced the mTHP sysfs
knobs instead of implementing automatic THP size switching within the
kernel. I'm skeptical about its practical utility in real-world
workloads.

In contrast, XFS large folio (AKA. File THP) can automatically select
orders between 0 and 9. Based on our verification, this feature has
proven genuinely useful for certain specific workloads—though it's not
yet perfect.

>
> > Background
> > ----------
> >
> > We have consistently configured THP to "never" on our production servers
> > due to past incidents caused by its behavior:
> >
> > - Increased memory consumption
> >    THP significantly raises overall memory usage.
> >
> > - Latency spikes
> >    Random latency spikes occur due to more frequent memory compaction
> >    activity triggered by THP.
> >
> > - Lack of Fine-Grained Control
> >    THP tuning knobs are globally configured, making them unsuitable for
> >    containerized environments. When different workloads run on the same
> >    host, enabling THP globally (without per-workload control) can cause
> >    unpredictable behavior.
> >
> > Due to these issues, system administrators remain hesitant to switch to
> > "madvise" or "always" modes—unless finer-grained control over THP
> > behavior is implemented.
> >
> > New Motivation
> > --------------
> >
> > We have now identified that certain AI workloads achieve substantial
> > performance gains with THP enabled. However, we’ve also verified that some
> > workloads see little to no benefit—or are even negatively impacted—by THP.
> >
> > In our Kubernetes environment, we deploy mixed workloads on a single server
> > to maximize resource utilization. Our goal is to selectively enable THP for
> > services that benefit from it while keeping it disabled for others. This
> > approach allows us to incrementally enable THP for additional services and
> > assess how to make it more viable in production.
> >
> > Proposed Solution
> > -----------------
> >
> > To enable fine-grained control over THP behavior, we propose dynamically
> > adjusting THP policies using BPF. This approach allows per-workload THP
> > tuning, providing greater flexibility and precision.
> >
> > The BPF-based THP adjustment mechanism introduces two new APIs for granular
> > policy control:
> >
> > - THP allocator
> >
> >    int (*allocator)(unsigned long vm_flags, unsigned long tva_flags);
> >
> >    The BPF program returns either THP_ALLOC_CURRENT or THP_ALLOC_KHUGEPAGED,
> >    indicating whether THP allocation should be performed synchronously
> >    (current task) or asynchronously (khugepaged).
> >
> >    The decision is based on the current task context, VMA flags, and TVA
> >    flags.
>
> I think we should go one step further and actually get advises about the
> orders (THP sizes) to use. It might be helpful if the program would have
> access to system stats, to make an educated decision.
>
> Given page fault information and system information, the program could
> then decide which orders to try to allocate.

Yes, that aligns with my thoughts as well. For instance, we could
automate the decision-making process based on factors like PSI, memory
fragmentation, and other metrics. However, this logic could be
implemented within BPF programs—all we’d need is to extend the feature
by introducing a few kfuncs (also known as BPF helpers).

>
> That means, one would query during page faults and during khugepaged,
> which order one should try -- compared to our current approach of "start
> with the largest order that is enabled and fits".
>
> >
> > - THP reclaimer
> >
> >    int (*reclaimer)(bool vma_madvised);
> >
> >    The BPF program returns either RECLAIMER_CURRENT or RECLAIMER_KSWAPD,
> >    determining whether memory reclamation is handled by the current task or
> >    kswapd.
>
> Not sure about that, will have to look into the details.

Some workloads allocate all their memory during initialization and do
not require THP at runtime. For such cases, aggressively attempting
THP allocation is beneficial. However, other workloads may dynamically
allocate THP during execution—if these are latency-sensitive, we must
avoid introducing long allocation delays.

Given these differing requirements, the global
/sys/kernel/mm/transparent_hugepage/defrag setting is insufficient.
Instead, we should implement per-workload defrag policies to better
optimize performance based on individual application behavior.

>
> But what could be interesting is deciding how to deal with underutilized
> THPs: for now we will try replacing zero-filled pages by the shared
> zeropage during a split. *maybe* some workloads could benefit from ...
> not doing that, and instead optimize the split.

I believe a per-workload THP shrinker (e.g.,
/sys/kernel/mm/transparent_hugepage/shrink_underused) would also be
valuable.
Thank you for the suggestion.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment
  2025-07-17  3:09   ` Yafang Shao
@ 2025-07-17  8:52     ` David Hildenbrand
  2025-07-17  9:05       ` Lorenzo Stoakes
  2025-07-20  2:32       ` Yafang Shao
  0 siblings, 2 replies; 33+ messages in thread
From: David Hildenbrand @ 2025-07-17  8:52 UTC (permalink / raw)
  To: Yafang Shao, Matthew Wilcox
  Cc: akpm, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett, npache,
	ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
	ast, daniel, andrii, bpf, linux-mm

On 17.07.25 05:09, Yafang Shao wrote:
> On Wed, Jul 16, 2025 at 6:42 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 08.06.25 09:35, Yafang Shao wrote:
>>
>> Sorry for not replying earlier, I was caught up with all other stuff.
>>
>> I still consider this a very interesting approach, although I think we
>> should think more about what a reasonable policy would look like
>> medoium-term (in particular, multiple THP sizes, not always falling back
>> to small pages if it means splitting excessively in the buddy etc.)
> 
> I find it difficult to understand why we introduced the mTHP sysfs
> knobs instead of implementing automatic THP size switching within the
> kernel. I'm skeptical about its practical utility in real-world
> workloads.
> 
> In contrast, XFS large folio (AKA. File THP) can automatically select
> orders between 0 and 9. Based on our verification, this feature has
> proven genuinely useful for certain specific workloads—though it's not
> yet perfect.

I suggest you do some digging about the history of these toggles and the 
plans for the future (automatic), there has been plenty of talk about 
all that.

[...]

>>>
>>> - THP allocator
>>>
>>>     int (*allocator)(unsigned long vm_flags, unsigned long tva_flags);
>>>
>>>     The BPF program returns either THP_ALLOC_CURRENT or THP_ALLOC_KHUGEPAGED,
>>>     indicating whether THP allocation should be performed synchronously
>>>     (current task) or asynchronously (khugepaged).
>>>
>>>     The decision is based on the current task context, VMA flags, and TVA
>>>     flags.
>>
>> I think we should go one step further and actually get advises about the
>> orders (THP sizes) to use. It might be helpful if the program would have
>> access to system stats, to make an educated decision.
>>
>> Given page fault information and system information, the program could
>> then decide which orders to try to allocate.
> 
> Yes, that aligns with my thoughts as well. For instance, we could
> automate the decision-making process based on factors like PSI, memory
> fragmentation, and other metrics. However, this logic could be
> implemented within BPF programs—all we’d need is to extend the feature
> by introducing a few kfuncs (also known as BPF helpers).

We discussed this yesterday at a THP upstream meeting, and what we 
should look into is:

(1) Having a callback like

unsigned int (*get_suggested_order)(.., bool in_pagefault);

Where we can provide some information about the fault (vma 
size/flags/anon_name), and whether we are in the page fault (or in 
khugepaged).

Maybe we want a bitmap of orders to try (fallback), not sure yet.

(2) Having some way to tag these callbacks as "this is absolutely 
unstable for now and can be changed as we please.".

One idea will be to use this mechanism as a way to easily prototype 
policies, and once we know that a policy works, start moving it into the 
core.

In general, the core, without a BPF program, should be able to continue 
providing a sane default behavior.

> 
>>
>> That means, one would query during page faults and during khugepaged,
>> which order one should try -- compared to our current approach of "start
>> with the largest order that is enabled and fits".
>>
>>>
>>> - THP reclaimer
>>>
>>>     int (*reclaimer)(bool vma_madvised);
>>>
>>>     The BPF program returns either RECLAIMER_CURRENT or RECLAIMER_KSWAPD,
>>>     determining whether memory reclamation is handled by the current task or
>>>     kswapd.
>>
>> Not sure about that, will have to look into the details.
> 
> Some workloads allocate all their memory during initialization and do
> not require THP at runtime. For such cases, aggressively attempting
> THP allocation is beneficial. However, other workloads may dynamically
> allocate THP during execution—if these are latency-sensitive, we must
> avoid introducing long allocation delays.
> 
> Given these differing requirements, the global
> /sys/kernel/mm/transparent_hugepage/defrag setting is insufficient.
> Instead, we should implement per-workload defrag policies to better
> optimize performance based on individual application behavior.

We'll be very careful about the callbacks we will offer. Maybe the 
get_suggested_order() callback could itself make a decision and not 
suggest a high order if allocation would require comapction.

Initially, we should keep it simple and see what other callbacks to add 
/ how to extend get_suggested_order(), to cover these cases.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment
  2025-07-17  8:52     ` David Hildenbrand
@ 2025-07-17  9:05       ` Lorenzo Stoakes
  2025-07-20  2:32       ` Yafang Shao
  1 sibling, 0 replies; 33+ messages in thread
From: Lorenzo Stoakes @ 2025-07-17  9:05 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Yafang Shao, Matthew Wilcox, akpm, ziy, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, ast, daniel, andrii, bpf, linux-mm

On Thu, Jul 17, 2025 at 10:52:12AM +0200, David Hildenbrand wrote:
> On 17.07.25 05:09, Yafang Shao wrote:
> > On Wed, Jul 16, 2025 at 6:42 AM David Hildenbrand <david@redhat.com> wrote:
> > > >
> > > > - THP allocator
> > > >
> > > >     int (*allocator)(unsigned long vm_flags, unsigned long tva_flags);
> > > >
> > > >     The BPF program returns either THP_ALLOC_CURRENT or THP_ALLOC_KHUGEPAGED,
> > > >     indicating whether THP allocation should be performed synchronously
> > > >     (current task) or asynchronously (khugepaged).
> > > >
> > > >     The decision is based on the current task context, VMA flags, and TVA
> > > >     flags.
> > >
> > > I think we should go one step further and actually get advises about the
> > > orders (THP sizes) to use. It might be helpful if the program would have
> > > access to system stats, to make an educated decision.
> > >
> > > Given page fault information and system information, the program could
> > > then decide which orders to try to allocate.
> >
> > Yes, that aligns with my thoughts as well. For instance, we could
> > automate the decision-making process based on factors like PSI, memory
> > fragmentation, and other metrics. However, this logic could be
> > implemented within BPF programs—all we’d need is to extend the feature
> > by introducing a few kfuncs (also known as BPF helpers).
>
> We discussed this yesterday at a THP upstream meeting, and what we should
> look into is:
>
> (1) Having a callback like
>
> unsigned int (*get_suggested_order)(.., bool in_pagefault);
>
> Where we can provide some information about the fault (vma
> size/flags/anon_name), and whether we are in the page fault (or in
> khugepaged).
>
> Maybe we want a bitmap of orders to try (fallback), not sure yet.

Ah I mentioned fallback below then noticed you mentioned here :)

>
> (2) Having some way to tag these callbacks as "this is absolutely unstable
> for now and can be changed as we please.".
>
> One idea will be to use this mechanism as a way to easily prototype
> policies, and once we know that a policy works, start moving it into the
> core.
>
> In general, the core, without a BPF program, should be able to continue
> providing a sane default behavior.

I have warmed to this approach overall and I think one thing that was very
clearly positive about this that came out of the call was the idea that we
can rapidly prototype different ideas.

I think a key to all this is ensuring that we:

- Mark this interface very clearly unstable to begin with.
- Keep the interface as simple as possible.

I think perhaps the more challenging thing here will be providing the right
amount of information to the caller to make decisions.

Also precisely how we use this too - obviously we need to be _trying_ to
allocate at the requested order but should that fail allocate less but
precisely how we do the fallback is something to think about.

I think generally this is the current best way forward before an automagic
world... which is a very long-term project.

>
> >
> > >
> > > That means, one would query during page faults and during khugepaged,
> > > which order one should try -- compared to our current approach of "start
> > > with the largest order that is enabled and fits".
> > >
> > > >
> > > > - THP reclaimer
> > > >
> > > >     int (*reclaimer)(bool vma_madvised);
> > > >
> > > >     The BPF program returns either RECLAIMER_CURRENT or RECLAIMER_KSWAPD,
> > > >     determining whether memory reclamation is handled by the current task or
> > > >     kswapd.
> > >
> > > Not sure about that, will have to look into the details.
> >
> > Some workloads allocate all their memory during initialization and do
> > not require THP at runtime. For such cases, aggressively attempting
> > THP allocation is beneficial. However, other workloads may dynamically
> > allocate THP during execution—if these are latency-sensitive, we must
> > avoid introducing long allocation delays.
> >
> > Given these differing requirements, the global
> > /sys/kernel/mm/transparent_hugepage/defrag setting is insufficient.
> > Instead, we should implement per-workload defrag policies to better
> > optimize performance based on individual application behavior.
>
> We'll be very careful about the callbacks we will offer. Maybe the
> get_suggested_order() callback could itself make a decision and not suggest
> a high order if allocation would require comapction.
>
> Initially, we should keep it simple and see what other callbacks to add /
> how to extend get_suggested_order(), to cover these cases.

Yes, caution vital here.

>
> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 1/5] mm, thp: use __thp_vma_allowable_orders() in khugepaged_enter_vma()
  2025-06-08  7:35 ` [RFC PATCH v3 1/5] mm, thp: use __thp_vma_allowable_orders() in khugepaged_enter_vma() Yafang Shao
@ 2025-07-17 14:48   ` Usama Arif
  2025-07-20  2:37     ` Yafang Shao
  0 siblings, 1 reply; 33+ messages in thread
From: Usama Arif @ 2025-07-17 14:48 UTC (permalink / raw)
  To: Yafang Shao, akpm, david, ziy, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, npache, ryan.roberts, dev.jain, hannes,
	gutierrez.asier, willy, ast, daniel, andrii
  Cc: bpf, linux-mm



On 08/06/2025 08:35, Yafang Shao wrote:
> The order has already been validated in hugepage_pmd_enabled(), so there's
> no need to recheck it in thp_vma_allowable_orders().
> 


The checks are not equivalent.

hugepage_pmd_enabled just checks if the sysfs entries allow hugification.
thp_vma_allowable_orders modifies the orders that can be used based on vm_flags,
which is not done in hugepage_pmd_enabled.

> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  mm/khugepaged.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 15203ea7d007..79e208999ddb 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -474,8 +474,8 @@ void khugepaged_enter_vma(struct vm_area_struct *vma,
>  {
>  	if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) &&
>  	    hugepage_pmd_enabled()) {
> -		if (thp_vma_allowable_order(vma, vm_flags, TVA_ENFORCE_SYSFS,
> -					    PMD_ORDER))
> +		if (__thp_vma_allowable_orders(vma, vm_flags, TVA_ENFORCE_SYSFS,
> +					       BIT(PMD_ORDER)))
>  			__khugepaged_enter(vma->vm_mm);
>  	}
>  }



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 2/5] mm, thp: add bpf thp hook to determine thp allocator
  2025-06-08  7:35 ` [RFC PATCH v3 2/5] mm, thp: add bpf thp hook to determine thp allocator Yafang Shao
@ 2025-07-17 15:30   ` Usama Arif
  2025-07-20  3:00     ` Yafang Shao
  0 siblings, 1 reply; 33+ messages in thread
From: Usama Arif @ 2025-07-17 15:30 UTC (permalink / raw)
  To: Yafang Shao, akpm, david, ziy, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, npache, ryan.roberts, dev.jain, hannes,
	gutierrez.asier, willy, ast, daniel, andrii
  Cc: bpf, linux-mm



On 08/06/2025 08:35, Yafang Shao wrote:
> A new hook bpf_thp_allocator() is added to determine if the THP is
> allocated by khugepaged or by the current task.

I would add in the summary why we need this. I am assuming I will find out
when reviewing the next few patches, but would be good to know here.

> 
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  include/linux/huge_mm.h | 10 ++++++++++
>  mm/khugepaged.c         |  2 ++
>  2 files changed, 12 insertions(+)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 2f190c90192d..db2eadd3f65b 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -190,6 +190,14 @@ static inline bool hugepage_global_always(void)
>  			(1<<TRANSPARENT_HUGEPAGE_FLAG);
>  }
>  
> +#define THP_ALLOC_KHUGEPAGED (1 << 1)
> +#define THP_ALLOC_CURRENT (1 << 2)
> +static inline int bpf_thp_allocator(unsigned long vm_flags,
> +				     unsigned long tva_flags)
> +{
> +	return THP_ALLOC_KHUGEPAGED | THP_ALLOC_CURRENT;

You dont use either vm_flags or tva_flags in this function?
I am guessing you wanted to check if these bits are set here?


But you dont seem to be setting these flags anywhere? I am guessing
its in a future patch. If it is, I would move the setting of these bits
here as its confusing to only see the check without knowing where its  

I feel this patch is broken and needs to be rewritten.
> +}
> +
>  static inline int highest_order(unsigned long orders)
>  {
>  	return fls_long(orders) - 1;
> @@ -290,6 +298,8 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>  	if ((tva_flags & TVA_ENFORCE_SYSFS) && vma_is_anonymous(vma)) {
>  		unsigned long mask = READ_ONCE(huge_anon_orders_always);
>  
> +		if (!(bpf_thp_allocator(vm_flags, tva_flags) & THP_ALLOC_CURRENT))
> +			return 0;

I am assuming that this is the point to check for allocation, but thp_vma_allowable_orders
is not just used for allocation, its used for in other places as well, like hugepage_vma_revalidate
and swap.

>  		if (vm_flags & VM_HUGEPAGE)
>  			mask |= READ_ONCE(huge_anon_orders_madvise);
>  		if (hugepage_global_always() ||
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 79e208999ddb..18f800fe7335 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -472,6 +472,8 @@ void __khugepaged_enter(struct mm_struct *mm)
>  void khugepaged_enter_vma(struct vm_area_struct *vma,
>  			  unsigned long vm_flags)
>  {
> +	if (!(bpf_thp_allocator(vm_flags, 0) & THP_ALLOC_KHUGEPAGED))
> +		return;
>  	if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) &&
>  	    hugepage_pmd_enabled()) {
>  		if (__thp_vma_allowable_orders(vma, vm_flags, TVA_ENFORCE_SYSFS,


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 3/5] mm, thp: add bpf thp hook to determine thp reclaimer
  2025-06-08  7:35 ` [RFC PATCH v3 3/5] mm, thp: add bpf thp hook to determine thp reclaimer Yafang Shao
@ 2025-07-17 16:06   ` Usama Arif
  2025-07-20  3:03     ` Yafang Shao
  0 siblings, 1 reply; 33+ messages in thread
From: Usama Arif @ 2025-07-17 16:06 UTC (permalink / raw)
  To: Yafang Shao, akpm, david, ziy, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, npache, ryan.roberts, dev.jain, hannes,
	gutierrez.asier, willy, ast, daniel, andrii
  Cc: bpf, linux-mm



On 08/06/2025 08:35, Yafang Shao wrote:
> A new hook, bpf_thp_gfp_mask(), is introduced to determine whether memory
> reclamation is being performed by the current task or by kswapd.
> 
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  include/linux/huge_mm.h | 5 +++++
>  mm/huge_memory.c        | 5 +++++
>  2 files changed, 10 insertions(+)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index db2eadd3f65b..6a40ebf25f5c 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -198,6 +198,11 @@ static inline int bpf_thp_allocator(unsigned long vm_flags,
>  	return THP_ALLOC_KHUGEPAGED | THP_ALLOC_CURRENT;
>  }
>  
> +static inline gfp_t bpf_thp_gfp_mask(bool vma_madvised)
> +{
> +	return 0;
> +}
> +
>  static inline int highest_order(unsigned long orders)
>  {
>  	return fls_long(orders) - 1;
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d3e66136e41a..81c1711d13fa 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1280,6 +1280,11 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>  gfp_t vma_thp_gfp_mask(struct vm_area_struct *vma)
>  {
>  	const bool vma_madvised = vma && (vma->vm_flags & VM_HUGEPAGE);
> +	gfp_t gfp_mask;
> +
> +	gfp_mask = bpf_thp_gfp_mask(vma_madvised);


I am guessing bpf_thp_gfp_mask returns 0, as its something yet to be implemented, 
but I really dont understand what this patch is supposed to do.


> +	if (gfp_mask)
> +		return gfp_mask;
>  
>  	/* Always do synchronous compaction */
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 4/5] mm: thp: add bpf thp struct ops
  2025-06-08  7:35 ` [RFC PATCH v3 4/5] mm: thp: add bpf thp struct ops Yafang Shao
@ 2025-07-17 16:25   ` Usama Arif
  2025-07-17 18:21   ` Amery Hung
  1 sibling, 0 replies; 33+ messages in thread
From: Usama Arif @ 2025-07-17 16:25 UTC (permalink / raw)
  To: Yafang Shao, akpm, david, ziy, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, npache, ryan.roberts, dev.jain, hannes,
	gutierrez.asier, willy, ast, daniel, andrii
  Cc: bpf, linux-mm



On 08/06/2025 08:35, Yafang Shao wrote:
> A new bpf_thp struct ops is introduced to provide finer-grained control
> over THP allocation policy. The struct ops includes two APIs for
> determining the THP allocator and reclaimer behavior:
> 
> - THP allocator
> 
>   int (*allocator)(unsigned long vm_flags, unsigned long tva_flags);
> 
>   The BPF program returns either THP_ALLOC_CURRENT or THP_ALLOC_KHUGEPAGED,
>   indicating whether THP allocation should be performed synchronously
>   (current task) or asynchronously (khugepaged).
> 
>   The decision is based on the current task context, VMA flags, and TVA
>   flags.
> 
> - THP reclaimer
> 
>   int (*reclaimer)(bool vma_madvised);
> 
>   The BPF program returns either RECLAIMER_CURRENT or RECLAIMER_KSWAPD,
>   determining whether memory reclamation is handled by the current task or
>   kswapd.
> 
>   The decision depends on the current task and VMA flags.
> 
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  include/linux/huge_mm.h |  13 +--
>  mm/Makefile             |   3 +
>  mm/bpf_thp.c            | 184 ++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 190 insertions(+), 10 deletions(-)
>  create mode 100644 mm/bpf_thp.c
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 6a40ebf25f5c..0d02c9b56a85 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -54,6 +54,7 @@ enum transparent_hugepage_flag {
>  	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
>  	TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
>  	TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
> +	TRANSPARENT_HUGEPAGE_BPF_ATTACHED,	/* BPF prog is attached */
>  };
>  
>  struct kobject;
> @@ -192,16 +193,8 @@ static inline bool hugepage_global_always(void)
>  
>  #define THP_ALLOC_KHUGEPAGED (1 << 1)
>  #define THP_ALLOC_CURRENT (1 << 2)
> -static inline int bpf_thp_allocator(unsigned long vm_flags,
> -				     unsigned long tva_flags)
> -{
> -	return THP_ALLOC_KHUGEPAGED | THP_ALLOC_CURRENT;
> -}
> -
> -static inline gfp_t bpf_thp_gfp_mask(bool vma_madvised)
> -{
> -	return 0;
> -}

It makes it quite confusing for review to add code in earlier patches and remove it here.

I dont think you should have had the first 3 patches? and the code is mostly in
this patch?

> +int bpf_thp_allocator(unsigned long vm_flags, unsigned long tva_flags);
> +gfp_t bpf_thp_gfp_mask(bool vma_madvised);
>  
>  static inline int highest_order(unsigned long orders)
>  {
> diff --git a/mm/Makefile b/mm/Makefile
> index 1a7a11d4933d..e5f41cf3fd61 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -99,6 +99,9 @@ obj-$(CONFIG_MIGRATION) += migrate.o
>  obj-$(CONFIG_NUMA) += memory-tiers.o
>  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> +ifdef CONFIG_BPF_SYSCALL
> +obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += bpf_thp.o
> +endif
>  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
>  obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
>  obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
> diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
> new file mode 100644
> index 000000000000..894d6cb93107
> --- /dev/null
> +++ b/mm/bpf_thp.c
> @@ -0,0 +1,184 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/bpf.h>
> +#include <linux/btf.h>
> +#include <linux/huge_mm.h>
> +#include <linux/khugepaged.h>
> +
> +#define RECLAIMER_CURRENT (1 << 1)
> +#define RECLAIMER_KSWAPD (1 << 2)
> +#define RECLAIMER_BOTH (RECLAIMER_CURRENT | RECLAIMER_KSWAPD)
> +
> +struct bpf_thp_ops {
> +	/**
> +	 * @allocator: Specifies whether the THP allocation is performed
> +	 * by the current task or by khugepaged.
> +	 * @vm_flags: Flags for the VMA in the current allocation context
> +	 * @tva_flags: Flags for the TVA in the current allocation context
> +	 *
> +	 * Rerurn:
> +	 * - THP_ALLOC_CURRENT: THP was allocated synchronously by the calling
> +	 *   task's context.
> +	 * - THP_ALLOC_KHUGEPAGED: THP was allocated asynchronously by the
> +	 *   khugepaged kernel thread.
> +	 * - 0: THP allocation is disallowed in the current context.
> +	 */
> +	int (*allocator)(unsigned long vm_flags, unsigned long tva_flags);
> +	/**
> +	 * @reclaimer: Specifies the entity performing page reclaim:
> +	 *             - current task context
> +	 *             - kswapd
> +	 *             - none (no reclaim)
> +	 * @vma_madvised: MADV flags for this VMA (e.g., MADV_HUGEPAGE, MADV_NOHUGEPAGE)
> +	 *
> +	 * Return:
> +	 * - RECLAIMER_CURRENT: Direct reclaim by the current task if THP
> +	 *   allocation fails.
> +	 * - RECLAIMER_KSWAPD: Wake kswapd to reclaim memory if THP allocation fails.
> +	 * - RECLAIMER_ALL: Both current and kswapd will perform the reclaim
> +	 * - 0: No reclaim will be attempted.
> +	 */
> +	int (*reclaimer)(bool vma_madvised);
> +};
> +
> +static struct bpf_thp_ops bpf_thp;
> +
> +int bpf_thp_allocator(unsigned long vm_flags, unsigned long tva_flags)
> +{
> +	int allocator;
> +
> +	/* No BPF program is attached */
> +	if (!(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_BPF_ATTACHED)))
> +		return THP_ALLOC_KHUGEPAGED | THP_ALLOC_CURRENT;
> +
> +	if (current_is_khugepaged())
> +		return THP_ALLOC_KHUGEPAGED | THP_ALLOC_CURRENT;
> +	if (!bpf_thp.allocator)
> +		return THP_ALLOC_KHUGEPAGED | THP_ALLOC_CURRENT;

Probably make it 

if (current_is_khugepaged() || !bpf_thp.allocator)
	return THP_ALLOC_KHUGEPAGED | THP_ALLOC_CURRENT;

> +
> +	allocator = bpf_thp.allocator(vm_flags, tva_flags);
> +	if (!allocator)
> +		return 0;
> +	/* invalid return value */
> +	if (allocator & ~(THP_ALLOC_KHUGEPAGED | THP_ALLOC_CURRENT))
> +		return THP_ALLOC_KHUGEPAGED | THP_ALLOC_CURRENT;
> +	return allocator;
> +}
> +
> +gfp_t bpf_thp_gfp_mask(bool vma_madvised)
> +{
> +	int reclaimer;
> +
> +	if (!(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_BPF_ATTACHED)))
> +		return 0;
> +
> +	if (!bpf_thp.reclaimer)
> +		return 0;
> +
> +	reclaimer = bpf_thp.reclaimer(vma_madvised);
> +	switch (reclaimer) {
> +	case RECLAIMER_CURRENT:
> +		return GFP_TRANSHUGE | __GFP_NORETRY;
> +	case RECLAIMER_KSWAPD:
> +		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
> +	case RECLAIMER_BOTH:
> +		return GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM | __GFP_NORETRY;
> +	default:
> +		return 0;

maybe you let the userspace decide GFP flags instead of having RECLAIMER_xyz?

> +	}
> +}
> +
> +static bool bpf_thp_ops_is_valid_access(int off, int size,
> +					enum bpf_access_type type,
> +					const struct bpf_prog *prog,
> +					struct bpf_insn_access_aux *info)
> +{
> +	return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
> +}
> +
> +static const struct bpf_func_proto *
> +bpf_thp_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> +{
> +	return bpf_base_func_proto(func_id, prog);
> +}
> +
> +static const struct bpf_verifier_ops thp_bpf_verifier_ops = {
> +	.get_func_proto = bpf_thp_get_func_proto,
> +	.is_valid_access = bpf_thp_ops_is_valid_access,
> +};
> +
> +static int bpf_thp_reg(void *kdata, struct bpf_link *link)
> +{
> +	struct bpf_thp_ops *ops = kdata;
> +
> +	/* TODO: add support for multiple attaches */
> +	if (test_and_set_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> +		&transparent_hugepage_flags))
> +		return -EOPNOTSUPP;
> +	bpf_thp.allocator = ops->allocator;
> +	bpf_thp.reclaimer = ops->reclaimer;
> +	return 0;
> +}
> +
> +static void bpf_thp_unreg(void *kdata, struct bpf_link *link)
> +{
> +	clear_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, &transparent_hugepage_flags);
> +	bpf_thp.allocator = NULL;
> +	bpf_thp.reclaimer = NULL;
> +}
> +
> +static int bpf_thp_check_member(const struct btf_type *t,
> +				const struct btf_member *member,
> +				const struct bpf_prog *prog)
> +{
> +	return 0;
> +}
> +
> +static int bpf_thp_init_member(const struct btf_type *t,
> +			       const struct btf_member *member,
> +			       void *kdata, const void *udata)
> +{
> +	return 0;
> +}
> +
> +static int bpf_thp_init(struct btf *btf)
> +{
> +	return 0;
> +}
> +
> +static int allocator(unsigned long vm_flags, unsigned long tva_flags)
> +{
> +	return 0;
> +}
> +
> +static int reclaimer(bool vma_madvised)
> +{
> +	return 0;
> +}
> +
> +static struct bpf_thp_ops __bpf_thp_ops = {
> +	.allocator = allocator,
> +	.reclaimer = reclaimer,
> +};
> +
> +static struct bpf_struct_ops bpf_bpf_thp_ops = {
> +	.verifier_ops = &thp_bpf_verifier_ops,
> +	.init = bpf_thp_init,
> +	.check_member = bpf_thp_check_member,
> +	.init_member = bpf_thp_init_member,
> +	.reg = bpf_thp_reg,
> +	.unreg = bpf_thp_unreg,
> +	.name = "bpf_thp_ops",
> +	.cfi_stubs = &__bpf_thp_ops,
> +	.owner = THIS_MODULE,
> +};
> +
> +static int __init bpf_thp_ops_init(void)
> +{
> +	int err = register_bpf_struct_ops(&bpf_bpf_thp_ops, bpf_thp_ops);
> +
> +	if (err)
> +		pr_err("bpf_thp: Failed to register struct_ops (%d)\n", err);
> +	return err;
> +}
> +late_initcall(bpf_thp_ops_init);


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment
  2025-06-08  7:35 [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment Yafang Shao
                   ` (5 preceding siblings ...)
  2025-07-15 22:42 ` [RFC PATCH v3 0/5] mm, bpf: BPF based " David Hildenbrand
@ 2025-07-17 16:35 ` Usama Arif
  2025-07-20  2:54   ` Yafang Shao
  6 siblings, 1 reply; 33+ messages in thread
From: Usama Arif @ 2025-07-17 16:35 UTC (permalink / raw)
  To: Yafang Shao, akpm, david, ziy, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, npache, ryan.roberts, dev.jain, hannes,
	gutierrez.asier, willy, ast, daniel, andrii
  Cc: bpf, linux-mm



On 08/06/2025 08:35, Yafang Shao wrote:
> Background
> ----------
> 
> We have consistently configured THP to "never" on our production servers
> due to past incidents caused by its behavior:
> 
> - Increased memory consumption
>   THP significantly raises overall memory usage.
> 
> - Latency spikes
>   Random latency spikes occur due to more frequent memory compaction
>   activity triggered by THP.
> 
> - Lack of Fine-Grained Control
>   THP tuning knobs are globally configured, making them unsuitable for
>   containerized environments. When different workloads run on the same
>   host, enabling THP globally (without per-workload control) can cause
>   unpredictable behavior.
> 
> Due to these issues, system administrators remain hesitant to switch to
> "madvise" or "always" modes—unless finer-grained control over THP
> behavior is implemented.
> 

Would this solution work and be carried over in fork+exec? As that is something
which is very common. How would that look like?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 4/5] mm: thp: add bpf thp struct ops
  2025-06-08  7:35 ` [RFC PATCH v3 4/5] mm: thp: add bpf thp struct ops Yafang Shao
  2025-07-17 16:25   ` Usama Arif
@ 2025-07-17 18:21   ` Amery Hung
  2025-07-20  3:07     ` Yafang Shao
  1 sibling, 1 reply; 33+ messages in thread
From: Amery Hung @ 2025-07-17 18:21 UTC (permalink / raw)
  To: Yafang Shao, akpm, david, ziy, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, npache, ryan.roberts, dev.jain, hannes,
	usamaarif642, gutierrez.asier, willy, ast, daniel, andrii
  Cc: bpf, linux-mm



On 6/8/25 12:35 AM, Yafang Shao wrote:
> A new bpf_thp struct ops is introduced to provide finer-grained control
> over THP allocation policy. The struct ops includes two APIs for
> determining the THP allocator and reclaimer behavior:
>
> - THP allocator
>
>    int (*allocator)(unsigned long vm_flags, unsigned long tva_flags);
>
>    The BPF program returns either THP_ALLOC_CURRENT or THP_ALLOC_KHUGEPAGED,
>    indicating whether THP allocation should be performed synchronously
>    (current task) or asynchronously (khugepaged).
>
>    The decision is based on the current task context, VMA flags, and TVA
>    flags.
>
> - THP reclaimer
>
>    int (*reclaimer)(bool vma_madvised);
>
>    The BPF program returns either RECLAIMER_CURRENT or RECLAIMER_KSWAPD,
>    determining whether memory reclamation is handled by the current task or
>    kswapd.
>
>    The decision depends on the current task and VMA flags.
>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>   include/linux/huge_mm.h |  13 +--
>   mm/Makefile             |   3 +
>   mm/bpf_thp.c            | 184 ++++++++++++++++++++++++++++++++++++++++
>   3 files changed, 190 insertions(+), 10 deletions(-)
>   create mode 100644 mm/bpf_thp.c
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 6a40ebf25f5c..0d02c9b56a85 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -54,6 +54,7 @@ enum transparent_hugepage_flag {
>   	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
>   	TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
>   	TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
> +	TRANSPARENT_HUGEPAGE_BPF_ATTACHED,	/* BPF prog is attached */
>   };
>   
>   struct kobject;
> @@ -192,16 +193,8 @@ static inline bool hugepage_global_always(void)
>   
>   #define THP_ALLOC_KHUGEPAGED (1 << 1)
>   #define THP_ALLOC_CURRENT (1 << 2)
> -static inline int bpf_thp_allocator(unsigned long vm_flags,
> -				     unsigned long tva_flags)
> -{
> -	return THP_ALLOC_KHUGEPAGED | THP_ALLOC_CURRENT;
> -}
> -
> -static inline gfp_t bpf_thp_gfp_mask(bool vma_madvised)
> -{
> -	return 0;
> -}
> +int bpf_thp_allocator(unsigned long vm_flags, unsigned long tva_flags);
> +gfp_t bpf_thp_gfp_mask(bool vma_madvised);
>   
>   static inline int highest_order(unsigned long orders)
>   {
> diff --git a/mm/Makefile b/mm/Makefile
> index 1a7a11d4933d..e5f41cf3fd61 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -99,6 +99,9 @@ obj-$(CONFIG_MIGRATION) += migrate.o
>   obj-$(CONFIG_NUMA) += memory-tiers.o
>   obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>   obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> +ifdef CONFIG_BPF_SYSCALL
> +obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += bpf_thp.o
> +endif
>   obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
>   obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
>   obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
> diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
> new file mode 100644
> index 000000000000..894d6cb93107
> --- /dev/null
> +++ b/mm/bpf_thp.c
> @@ -0,0 +1,184 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include <linux/bpf.h>
> +#include <linux/btf.h>
> +#include <linux/huge_mm.h>
> +#include <linux/khugepaged.h>
> +
> +#define RECLAIMER_CURRENT (1 << 1)
> +#define RECLAIMER_KSWAPD (1 << 2)
> +#define RECLAIMER_BOTH (RECLAIMER_CURRENT | RECLAIMER_KSWAPD)
> +
> +struct bpf_thp_ops {
> +	/**
> +	 * @allocator: Specifies whether the THP allocation is performed
> +	 * by the current task or by khugepaged.
> +	 * @vm_flags: Flags for the VMA in the current allocation context
> +	 * @tva_flags: Flags for the TVA in the current allocation context
> +	 *
> +	 * Rerurn:
> +	 * - THP_ALLOC_CURRENT: THP was allocated synchronously by the calling
> +	 *   task's context.
> +	 * - THP_ALLOC_KHUGEPAGED: THP was allocated asynchronously by the
> +	 *   khugepaged kernel thread.
> +	 * - 0: THP allocation is disallowed in the current context.
> +	 */
> +	int (*allocator)(unsigned long vm_flags, unsigned long tva_flags);
> +	/**
> +	 * @reclaimer: Specifies the entity performing page reclaim:
> +	 *             - current task context
> +	 *             - kswapd
> +	 *             - none (no reclaim)
> +	 * @vma_madvised: MADV flags for this VMA (e.g., MADV_HUGEPAGE, MADV_NOHUGEPAGE)
> +	 *
> +	 * Return:
> +	 * - RECLAIMER_CURRENT: Direct reclaim by the current task if THP
> +	 *   allocation fails.
> +	 * - RECLAIMER_KSWAPD: Wake kswapd to reclaim memory if THP allocation fails.
> +	 * - RECLAIMER_ALL: Both current and kswapd will perform the reclaim
> +	 * - 0: No reclaim will be attempted.
> +	 */
> +	int (*reclaimer)(bool vma_madvised);
> +};
> +
> +static struct bpf_thp_ops bpf_thp;
> +
> +int bpf_thp_allocator(unsigned long vm_flags, unsigned long tva_flags)
> +{
> +	int allocator;
> +
> +	/* No BPF program is attached */
> +	if (!(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_BPF_ATTACHED)))
> +		return THP_ALLOC_KHUGEPAGED | THP_ALLOC_CURRENT;
> +
> +	if (current_is_khugepaged())
> +		return THP_ALLOC_KHUGEPAGED | THP_ALLOC_CURRENT;
> +	if (!bpf_thp.allocator)
> +		return THP_ALLOC_KHUGEPAGED | THP_ALLOC_CURRENT;
> +
> +	allocator = bpf_thp.allocator(vm_flags, tva_flags);
> +	if (!allocator)
> +		return 0;

The check seems redundant. Is it?

> +	/* invalid return value */
> +	if (allocator & ~(THP_ALLOC_KHUGEPAGED | THP_ALLOC_CURRENT))
> +		return THP_ALLOC_KHUGEPAGED | THP_ALLOC_CURRENT;
> +	return allocator;
> +}
> +
> +gfp_t bpf_thp_gfp_mask(bool vma_madvised)
> +{
> +	int reclaimer;
> +
> +	if (!(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_BPF_ATTACHED)))
> +		return 0;
> +
> +	if (!bpf_thp.reclaimer)
> +		return 0;
> +
> +	reclaimer = bpf_thp.reclaimer(vma_madvised);
> +	switch (reclaimer) {
> +	case RECLAIMER_CURRENT:
> +		return GFP_TRANSHUGE | __GFP_NORETRY;
> +	case RECLAIMER_KSWAPD:
> +		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
> +	case RECLAIMER_BOTH:
> +		return GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM | __GFP_NORETRY;
> +	default:
> +		return 0;
> +	}
> +}
> +
> +static bool bpf_thp_ops_is_valid_access(int off, int size,
> +					enum bpf_access_type type,
> +					const struct bpf_prog *prog,
> +					struct bpf_insn_access_aux *info)
> +{
> +	return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
> +}
> +
> +static const struct bpf_func_proto *
> +bpf_thp_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> +{
> +	return bpf_base_func_proto(func_id, prog);
> +}
> +
> +static const struct bpf_verifier_ops thp_bpf_verifier_ops = {
> +	.get_func_proto = bpf_thp_get_func_proto,
> +	.is_valid_access = bpf_thp_ops_is_valid_access,
> +};
> +
> +static int bpf_thp_reg(void *kdata, struct bpf_link *link)
> +{
> +	struct bpf_thp_ops *ops = kdata;
> +
> +	/* TODO: add support for multiple attaches */
> +	if (test_and_set_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> +		&transparent_hugepage_flags))
> +		return -EOPNOTSUPP;

I think returning -EBUSY if the struct_ops is already attached is a 
better choice

> +	bpf_thp.allocator = ops->allocator;
> +	bpf_thp.reclaimer = ops->reclaimer;
> +	return 0;
> +}
> +
> +static void bpf_thp_unreg(void *kdata, struct bpf_link *link)
> +{
> +	clear_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, &transparent_hugepage_flags);
> +	bpf_thp.allocator = NULL;
> +	bpf_thp.reclaimer = NULL;
> +}
> +
> +static int bpf_thp_check_member(const struct btf_type *t,
> +				const struct btf_member *member,
> +				const struct bpf_prog *prog)
> +{
> +	return 0;
> +}
> +

[...]

> +static int bpf_thp_init_member(const struct btf_type *t,
> +			       const struct btf_member *member,
> +			       void *kdata, const void *udata)
> +{
> +	return 0;
> +}
> +
> +static int bpf_thp_init(struct btf *btf)
> +{
> +	return 0;
> +}
> +
> +static int allocator(unsigned long vm_flags, unsigned long tva_flags)
> +{
> +	return 0;
> +}
> +
> +static int reclaimer(bool vma_madvised)
> +{
> +	return 0;
> +}
> +
> +static struct bpf_thp_ops __bpf_thp_ops = {
> +	.allocator = allocator,
> +	.reclaimer = reclaimer,
> +};
> +
> +static struct bpf_struct_ops bpf_bpf_thp_ops = {
> +	.verifier_ops = &thp_bpf_verifier_ops,
> +	.init = bpf_thp_init,
> +	.check_member = bpf_thp_check_member,

nit. check_member doesn't need to be defined if it does not do anything.

> +	.init_member = bpf_thp_init_member,
> +	.reg = bpf_thp_reg,
> +	.unreg = bpf_thp_unreg,
> +	.name = "bpf_thp_ops",
> +	.cfi_stubs = &__bpf_thp_ops,
> +	.owner = THIS_MODULE,
> +};
> +
> +static int __init bpf_thp_ops_init(void)
> +{
> +	int err = register_bpf_struct_ops(&bpf_bpf_thp_ops, bpf_thp_ops);
> +
> +	if (err)
> +		pr_err("bpf_thp: Failed to register struct_ops (%d)\n", err);
> +	return err;
> +}
> +late_initcall(bpf_thp_ops_init);


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment
  2025-07-17  8:52     ` David Hildenbrand
  2025-07-17  9:05       ` Lorenzo Stoakes
@ 2025-07-20  2:32       ` Yafang Shao
  2025-07-20 15:56         ` David Hildenbrand
  1 sibling, 1 reply; 33+ messages in thread
From: Yafang Shao @ 2025-07-20  2:32 UTC (permalink / raw)
  To: David Hildenbrand, Alexei Starovoitov
  Cc: Matthew Wilcox, akpm, ziy, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, npache, ryan.roberts, dev.jain, hannes,
	usamaarif642, gutierrez.asier, ast, daniel, andrii, bpf, linux-mm

On Thu, Jul 17, 2025 at 4:52 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 17.07.25 05:09, Yafang Shao wrote:
> > On Wed, Jul 16, 2025 at 6:42 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 08.06.25 09:35, Yafang Shao wrote:
> >>
> >> Sorry for not replying earlier, I was caught up with all other stuff.
> >>
> >> I still consider this a very interesting approach, although I think we
> >> should think more about what a reasonable policy would look like
> >> medoium-term (in particular, multiple THP sizes, not always falling back
> >> to small pages if it means splitting excessively in the buddy etc.)
> >
> > I find it difficult to understand why we introduced the mTHP sysfs
> > knobs instead of implementing automatic THP size switching within the
> > kernel. I'm skeptical about its practical utility in real-world
> > workloads.
> >
> > In contrast, XFS large folio (AKA. File THP) can automatically select
> > orders between 0 and 9. Based on our verification, this feature has
> > proven genuinely useful for certain specific workloads—though it's not
> > yet perfect.
>
> I suggest you do some digging about the history of these toggles and the
> plans for the future (automatic), there has been plenty of talk about
> all that.
>
> [...]
>
> >>>
> >>> - THP allocator
> >>>
> >>>     int (*allocator)(unsigned long vm_flags, unsigned long tva_flags);
> >>>
> >>>     The BPF program returns either THP_ALLOC_CURRENT or THP_ALLOC_KHUGEPAGED,
> >>>     indicating whether THP allocation should be performed synchronously
> >>>     (current task) or asynchronously (khugepaged).
> >>>
> >>>     The decision is based on the current task context, VMA flags, and TVA
> >>>     flags.
> >>
> >> I think we should go one step further and actually get advises about the
> >> orders (THP sizes) to use. It might be helpful if the program would have
> >> access to system stats, to make an educated decision.
> >>
> >> Given page fault information and system information, the program could
> >> then decide which orders to try to allocate.
> >
> > Yes, that aligns with my thoughts as well. For instance, we could
> > automate the decision-making process based on factors like PSI, memory
> > fragmentation, and other metrics. However, this logic could be
> > implemented within BPF programs—all we’d need is to extend the feature
> > by introducing a few kfuncs (also known as BPF helpers).
>
> We discussed this yesterday at a THP upstream meeting, and what we
> should look into is:
>
> (1) Having a callback like
>
> unsigned int (*get_suggested_order)(.., bool in_pagefault);

This interface meets our needs precisely, enabling allocation orders
of either 0 or 9 as required by our workloads.

>
> Where we can provide some information about the fault (vma
> size/flags/anon_name), and whether we are in the page fault (or in
> khugepaged).
>
> Maybe we want a bitmap of orders to try (fallback), not sure yet.
>
> (2) Having some way to tag these callbacks as "this is absolutely
> unstable for now and can be changed as we please.".

BPF has already helped us complete this, so we don’t need to implement
this restriction.
Note that all BPF kfuncs (including struct_ops) are currently unstable
and may change in the future.

Alexei, could you confirm this understanding?

>
> One idea will be to use this mechanism as a way to easily prototype
> policies, and once we know that a policy works, start moving it into the
> core.
>
> In general, the core, without a BPF program, should be able to continue
> providing a sane default behavior.

makes sense.

>
> >
> >>
> >> That means, one would query during page faults and during khugepaged,
> >> which order one should try -- compared to our current approach of "start
> >> with the largest order that is enabled and fits".
> >>
> >>>
> >>> - THP reclaimer
> >>>
> >>>     int (*reclaimer)(bool vma_madvised);
> >>>
> >>>     The BPF program returns either RECLAIMER_CURRENT or RECLAIMER_KSWAPD,
> >>>     determining whether memory reclamation is handled by the current task or
> >>>     kswapd.
> >>
> >> Not sure about that, will have to look into the details.
> >
> > Some workloads allocate all their memory during initialization and do
> > not require THP at runtime. For such cases, aggressively attempting
> > THP allocation is beneficial. However, other workloads may dynamically
> > allocate THP during execution—if these are latency-sensitive, we must
> > avoid introducing long allocation delays.
> >
> > Given these differing requirements, the global
> > /sys/kernel/mm/transparent_hugepage/defrag setting is insufficient.
> > Instead, we should implement per-workload defrag policies to better
> > optimize performance based on individual application behavior.
>
> We'll be very careful about the callbacks we will offer. Maybe the
> get_suggested_order() callback could itself make a decision and not
> suggest a high order if allocation would require comapction.
>
> Initially, we should keep it simple and see what other callbacks to add
> / how to extend get_suggested_order(), to cover these cases.

Yes, we can proceed by adding a simple get_suggested_order() and
address any remaining details in follow-up work.

--
Regards

Yafang

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 1/5] mm, thp: use __thp_vma_allowable_orders() in khugepaged_enter_vma()
  2025-07-17 14:48   ` Usama Arif
@ 2025-07-20  2:37     ` Yafang Shao
  0 siblings, 0 replies; 33+ messages in thread
From: Yafang Shao @ 2025-07-20  2:37 UTC (permalink / raw)
  To: Usama Arif
  Cc: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, gutierrez.asier, willy,
	ast, daniel, andrii, bpf, linux-mm

On Thu, Jul 17, 2025 at 10:48 PM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 08/06/2025 08:35, Yafang Shao wrote:
> > The order has already been validated in hugepage_pmd_enabled(), so there's
> > no need to recheck it in thp_vma_allowable_orders().
> >
>
>
> The checks are not equivalent.
>
> hugepage_pmd_enabled just checks if the sysfs entries allow hugification.
> thp_vma_allowable_orders modifies the orders that can be used based on vm_flags,
> which is not done in hugepage_pmd_enabled.

Thank you for catching that. I overlooked the vm_flags check.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment
  2025-07-17 16:35 ` Usama Arif
@ 2025-07-20  2:54   ` Yafang Shao
  0 siblings, 0 replies; 33+ messages in thread
From: Yafang Shao @ 2025-07-20  2:54 UTC (permalink / raw)
  To: Usama Arif
  Cc: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, gutierrez.asier, willy,
	ast, daniel, andrii, bpf, linux-mm

On Fri, Jul 18, 2025 at 12:35 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 08/06/2025 08:35, Yafang Shao wrote:
> > Background
> > ----------
> >
> > We have consistently configured THP to "never" on our production servers
> > due to past incidents caused by its behavior:
> >
> > - Increased memory consumption
> >   THP significantly raises overall memory usage.
> >
> > - Latency spikes
> >   Random latency spikes occur due to more frequent memory compaction
> >   activity triggered by THP.
> >
> > - Lack of Fine-Grained Control
> >   THP tuning knobs are globally configured, making them unsuitable for
> >   containerized environments. When different workloads run on the same
> >   host, enabling THP globally (without per-workload control) can cause
> >   unpredictable behavior.
> >
> > Due to these issues, system administrators remain hesitant to switch to
> > "madvise" or "always" modes—unless finer-grained control over THP
> > behavior is implemented.
> >
>
> Would this solution work and be carried over in fork+exec? As that is something
> which is very common. How would that look like?

The current implementation doesn't handle the fork+exec case properly.
I've developed an updated version that addresses this issue, though I
haven't submitted it for review yet.

In the new version:
khugepaged_fork() will verify process eligibility before entering
__khugepaged_enter()
If a parent process loses THP allocation privileges:
1. its child processes will skip __khugepaged_enter()
2. it will be removed from the khugepaged mm_slot.
3. the MMF_VM_HUGEPAGE will be cleared

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 2/5] mm, thp: add bpf thp hook to determine thp allocator
  2025-07-17 15:30   ` Usama Arif
@ 2025-07-20  3:00     ` Yafang Shao
  0 siblings, 0 replies; 33+ messages in thread
From: Yafang Shao @ 2025-07-20  3:00 UTC (permalink / raw)
  To: Usama Arif
  Cc: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, gutierrez.asier, willy,
	ast, daniel, andrii, bpf, linux-mm

On Thu, Jul 17, 2025 at 11:30 PM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 08/06/2025 08:35, Yafang Shao wrote:
> > A new hook bpf_thp_allocator() is added to determine if the THP is
> > allocated by khugepaged or by the current task.
>
> I would add in the summary why we need this. I am assuming I will find out
> when reviewing the next few patches, but would be good to know here.
>
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  include/linux/huge_mm.h | 10 ++++++++++
> >  mm/khugepaged.c         |  2 ++
> >  2 files changed, 12 insertions(+)
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 2f190c90192d..db2eadd3f65b 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -190,6 +190,14 @@ static inline bool hugepage_global_always(void)
> >                       (1<<TRANSPARENT_HUGEPAGE_FLAG);
> >  }
> >
> > +#define THP_ALLOC_KHUGEPAGED (1 << 1)
> > +#define THP_ALLOC_CURRENT (1 << 2)
> > +static inline int bpf_thp_allocator(unsigned long vm_flags,
> > +                                  unsigned long tva_flags)
> > +{
> > +     return THP_ALLOC_KHUGEPAGED | THP_ALLOC_CURRENT;
>
> You dont use either vm_flags or tva_flags in this function?
> I am guessing you wanted to check if these bits are set here?
>
>
> But you dont seem to be setting these flags anywhere? I am guessing
> its in a future patch. If it is, I would move the setting of these bits
> here as its confusing to only see the check without knowing where its
>
> I feel this patch is broken and needs to be rewritten.

The `bpf_thp_allocator()` function serves as a placeholder that will
be implemented by a BPF program. The BPF implementation will use the
`@vm_flags` and `@tva_flags` parameters to determine whether a task
qualifies for THP allocation.

I see how this could be confusing.

> > +}
> > +
> >  static inline int highest_order(unsigned long orders)
> >  {
> >       return fls_long(orders) - 1;
> > @@ -290,6 +298,8 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
> >       if ((tva_flags & TVA_ENFORCE_SYSFS) && vma_is_anonymous(vma)) {
> >               unsigned long mask = READ_ONCE(huge_anon_orders_always);
> >
> > +             if (!(bpf_thp_allocator(vm_flags, tva_flags) & THP_ALLOC_CURRENT))
> > +                     return 0;
>
> I am assuming that this is the point to check for allocation, but thp_vma_allowable_orders
> is not just used for allocation, its used for in other places as well, like hugepage_vma_revalidate
> and swap.

Agreed, some adjustments are necessary.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 3/5] mm, thp: add bpf thp hook to determine thp reclaimer
  2025-07-17 16:06   ` Usama Arif
@ 2025-07-20  3:03     ` Yafang Shao
  0 siblings, 0 replies; 33+ messages in thread
From: Yafang Shao @ 2025-07-20  3:03 UTC (permalink / raw)
  To: Usama Arif
  Cc: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, gutierrez.asier, willy,
	ast, daniel, andrii, bpf, linux-mm

On Fri, Jul 18, 2025 at 12:06 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 08/06/2025 08:35, Yafang Shao wrote:
> > A new hook, bpf_thp_gfp_mask(), is introduced to determine whether memory
> > reclamation is being performed by the current task or by kswapd.
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  include/linux/huge_mm.h | 5 +++++
> >  mm/huge_memory.c        | 5 +++++
> >  2 files changed, 10 insertions(+)
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index db2eadd3f65b..6a40ebf25f5c 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -198,6 +198,11 @@ static inline int bpf_thp_allocator(unsigned long vm_flags,
> >       return THP_ALLOC_KHUGEPAGED | THP_ALLOC_CURRENT;
> >  }
> >
> > +static inline gfp_t bpf_thp_gfp_mask(bool vma_madvised)
> > +{
> > +     return 0;
> > +}
> > +
> >  static inline int highest_order(unsigned long orders)
> >  {
> >       return fls_long(orders) - 1;
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index d3e66136e41a..81c1711d13fa 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1280,6 +1280,11 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> >  gfp_t vma_thp_gfp_mask(struct vm_area_struct *vma)
> >  {
> >       const bool vma_madvised = vma && (vma->vm_flags & VM_HUGEPAGE);
> > +     gfp_t gfp_mask;
> > +
> > +     gfp_mask = bpf_thp_gfp_mask(vma_madvised);
>
>
> I am guessing bpf_thp_gfp_mask returns 0, as its something yet to be implemented,
> but I really dont understand what this patch is supposed to do.

This change only introduces a placeholder for the BPF program. It
might be cleaner to squash it into patch #4.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 4/5] mm: thp: add bpf thp struct ops
  2025-07-17 18:21   ` Amery Hung
@ 2025-07-20  3:07     ` Yafang Shao
  0 siblings, 0 replies; 33+ messages in thread
From: Yafang Shao @ 2025-07-20  3:07 UTC (permalink / raw)
  To: Amery Hung
  Cc: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, bpf, linux-mm

On Fri, Jul 18, 2025 at 2:21 AM Amery Hung <ameryhung@gmail.com> wrote:
>
>
>
> On 6/8/25 12:35 AM, Yafang Shao wrote:
> > A new bpf_thp struct ops is introduced to provide finer-grained control
> > over THP allocation policy. The struct ops includes two APIs for
> > determining the THP allocator and reclaimer behavior:
> >
> > - THP allocator
> >
> >    int (*allocator)(unsigned long vm_flags, unsigned long tva_flags);
> >
> >    The BPF program returns either THP_ALLOC_CURRENT or THP_ALLOC_KHUGEPAGED,
> >    indicating whether THP allocation should be performed synchronously
> >    (current task) or asynchronously (khugepaged).
> >
> >    The decision is based on the current task context, VMA flags, and TVA
> >    flags.
> >
> > - THP reclaimer
> >
> >    int (*reclaimer)(bool vma_madvised);
> >
> >    The BPF program returns either RECLAIMER_CURRENT or RECLAIMER_KSWAPD,
> >    determining whether memory reclamation is handled by the current task or
> >    kswapd.
> >
> >    The decision depends on the current task and VMA flags.
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >   include/linux/huge_mm.h |  13 +--
> >   mm/Makefile             |   3 +
> >   mm/bpf_thp.c            | 184 ++++++++++++++++++++++++++++++++++++++++
> >   3 files changed, 190 insertions(+), 10 deletions(-)
> >   create mode 100644 mm/bpf_thp.c
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 6a40ebf25f5c..0d02c9b56a85 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -54,6 +54,7 @@ enum transparent_hugepage_flag {
> >       TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
> >       TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
> >       TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
> > +     TRANSPARENT_HUGEPAGE_BPF_ATTACHED,      /* BPF prog is attached */
> >   };
> >
> >   struct kobject;
> > @@ -192,16 +193,8 @@ static inline bool hugepage_global_always(void)
> >
> >   #define THP_ALLOC_KHUGEPAGED (1 << 1)
> >   #define THP_ALLOC_CURRENT (1 << 2)
> > -static inline int bpf_thp_allocator(unsigned long vm_flags,
> > -                                  unsigned long tva_flags)
> > -{
> > -     return THP_ALLOC_KHUGEPAGED | THP_ALLOC_CURRENT;
> > -}
> > -
> > -static inline gfp_t bpf_thp_gfp_mask(bool vma_madvised)
> > -{
> > -     return 0;
> > -}
> > +int bpf_thp_allocator(unsigned long vm_flags, unsigned long tva_flags);
> > +gfp_t bpf_thp_gfp_mask(bool vma_madvised);
> >
> >   static inline int highest_order(unsigned long orders)
> >   {
> > diff --git a/mm/Makefile b/mm/Makefile
> > index 1a7a11d4933d..e5f41cf3fd61 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -99,6 +99,9 @@ obj-$(CONFIG_MIGRATION) += migrate.o
> >   obj-$(CONFIG_NUMA) += memory-tiers.o
> >   obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
> >   obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> > +ifdef CONFIG_BPF_SYSCALL
> > +obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += bpf_thp.o
> > +endif
> >   obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
> >   obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
> >   obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
> > diff --git a/mm/bpf_thp.c b/mm/bpf_thp.c
> > new file mode 100644
> > index 000000000000..894d6cb93107
> > --- /dev/null
> > +++ b/mm/bpf_thp.c
> > @@ -0,0 +1,184 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +#include <linux/bpf.h>
> > +#include <linux/btf.h>
> > +#include <linux/huge_mm.h>
> > +#include <linux/khugepaged.h>
> > +
> > +#define RECLAIMER_CURRENT (1 << 1)
> > +#define RECLAIMER_KSWAPD (1 << 2)
> > +#define RECLAIMER_BOTH (RECLAIMER_CURRENT | RECLAIMER_KSWAPD)
> > +
> > +struct bpf_thp_ops {
> > +     /**
> > +      * @allocator: Specifies whether the THP allocation is performed
> > +      * by the current task or by khugepaged.
> > +      * @vm_flags: Flags for the VMA in the current allocation context
> > +      * @tva_flags: Flags for the TVA in the current allocation context
> > +      *
> > +      * Rerurn:
> > +      * - THP_ALLOC_CURRENT: THP was allocated synchronously by the calling
> > +      *   task's context.
> > +      * - THP_ALLOC_KHUGEPAGED: THP was allocated asynchronously by the
> > +      *   khugepaged kernel thread.
> > +      * - 0: THP allocation is disallowed in the current context.
> > +      */
> > +     int (*allocator)(unsigned long vm_flags, unsigned long tva_flags);
> > +     /**
> > +      * @reclaimer: Specifies the entity performing page reclaim:
> > +      *             - current task context
> > +      *             - kswapd
> > +      *             - none (no reclaim)
> > +      * @vma_madvised: MADV flags for this VMA (e.g., MADV_HUGEPAGE, MADV_NOHUGEPAGE)
> > +      *
> > +      * Return:
> > +      * - RECLAIMER_CURRENT: Direct reclaim by the current task if THP
> > +      *   allocation fails.
> > +      * - RECLAIMER_KSWAPD: Wake kswapd to reclaim memory if THP allocation fails.
> > +      * - RECLAIMER_ALL: Both current and kswapd will perform the reclaim
> > +      * - 0: No reclaim will be attempted.
> > +      */
> > +     int (*reclaimer)(bool vma_madvised);
> > +};
> > +
> > +static struct bpf_thp_ops bpf_thp;
> > +
> > +int bpf_thp_allocator(unsigned long vm_flags, unsigned long tva_flags)
> > +{
> > +     int allocator;
> > +
> > +     /* No BPF program is attached */
> > +     if (!(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_BPF_ATTACHED)))
> > +             return THP_ALLOC_KHUGEPAGED | THP_ALLOC_CURRENT;
> > +
> > +     if (current_is_khugepaged())
> > +             return THP_ALLOC_KHUGEPAGED | THP_ALLOC_CURRENT;
> > +     if (!bpf_thp.allocator)
> > +             return THP_ALLOC_KHUGEPAGED | THP_ALLOC_CURRENT;
> > +
> > +     allocator = bpf_thp.allocator(vm_flags, tva_flags);
> > +     if (!allocator)
> > +             return 0;
>
> The check seems redundant. Is it?

Right, thanks for pointing it out.

>
> > +     /* invalid return value */
> > +     if (allocator & ~(THP_ALLOC_KHUGEPAGED | THP_ALLOC_CURRENT))
> > +             return THP_ALLOC_KHUGEPAGED | THP_ALLOC_CURRENT;
> > +     return allocator;
> > +}
> > +
> > +gfp_t bpf_thp_gfp_mask(bool vma_madvised)
> > +{
> > +     int reclaimer;
> > +
> > +     if (!(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_BPF_ATTACHED)))
> > +             return 0;
> > +
> > +     if (!bpf_thp.reclaimer)
> > +             return 0;
> > +
> > +     reclaimer = bpf_thp.reclaimer(vma_madvised);
> > +     switch (reclaimer) {
> > +     case RECLAIMER_CURRENT:
> > +             return GFP_TRANSHUGE | __GFP_NORETRY;
> > +     case RECLAIMER_KSWAPD:
> > +             return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
> > +     case RECLAIMER_BOTH:
> > +             return GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM | __GFP_NORETRY;
> > +     default:
> > +             return 0;
> > +     }
> > +}
> > +
> > +static bool bpf_thp_ops_is_valid_access(int off, int size,
> > +                                     enum bpf_access_type type,
> > +                                     const struct bpf_prog *prog,
> > +                                     struct bpf_insn_access_aux *info)
> > +{
> > +     return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
> > +}
> > +
> > +static const struct bpf_func_proto *
> > +bpf_thp_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> > +{
> > +     return bpf_base_func_proto(func_id, prog);
> > +}
> > +
> > +static const struct bpf_verifier_ops thp_bpf_verifier_ops = {
> > +     .get_func_proto = bpf_thp_get_func_proto,
> > +     .is_valid_access = bpf_thp_ops_is_valid_access,
> > +};
> > +
> > +static int bpf_thp_reg(void *kdata, struct bpf_link *link)
> > +{
> > +     struct bpf_thp_ops *ops = kdata;
> > +
> > +     /* TODO: add support for multiple attaches */
> > +     if (test_and_set_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> > +             &transparent_hugepage_flags))
> > +             return -EOPNOTSUPP;
>
> I think returning -EBUSY if the struct_ops is already attached is a
> better choice

Makes sense. Thanks for the suggestion.

>
> > +     bpf_thp.allocator = ops->allocator;
> > +     bpf_thp.reclaimer = ops->reclaimer;
> > +     return 0;
> > +}
> > +
> > +static void bpf_thp_unreg(void *kdata, struct bpf_link *link)
> > +{
> > +     clear_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, &transparent_hugepage_flags);
> > +     bpf_thp.allocator = NULL;
> > +     bpf_thp.reclaimer = NULL;
> > +}
> > +
> > +static int bpf_thp_check_member(const struct btf_type *t,
> > +                             const struct btf_member *member,
> > +                             const struct bpf_prog *prog)
> > +{
> > +     return 0;
> > +}
> > +
>
> [...]
>
> > +static int bpf_thp_init_member(const struct btf_type *t,
> > +                            const struct btf_member *member,
> > +                            void *kdata, const void *udata)
> > +{
> > +     return 0;
> > +}
> > +
> > +static int bpf_thp_init(struct btf *btf)
> > +{
> > +     return 0;
> > +}
> > +
> > +static int allocator(unsigned long vm_flags, unsigned long tva_flags)
> > +{
> > +     return 0;
> > +}
> > +
> > +static int reclaimer(bool vma_madvised)
> > +{
> > +     return 0;
> > +}
> > +
> > +static struct bpf_thp_ops __bpf_thp_ops = {
> > +     .allocator = allocator,
> > +     .reclaimer = reclaimer,
> > +};
> > +
> > +static struct bpf_struct_ops bpf_bpf_thp_ops = {
> > +     .verifier_ops = &thp_bpf_verifier_ops,
> > +     .init = bpf_thp_init,
> > +     .check_member = bpf_thp_check_member,
>
> nit. check_member doesn't need to be defined if it does not do anything.

I will remove it.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment
  2025-07-20  2:32       ` Yafang Shao
@ 2025-07-20 15:56         ` David Hildenbrand
  2025-07-22  2:40           ` Yafang Shao
  0 siblings, 1 reply; 33+ messages in thread
From: David Hildenbrand @ 2025-07-20 15:56 UTC (permalink / raw)
  To: Yafang Shao, Alexei Starovoitov
  Cc: Matthew Wilcox, akpm, ziy, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, npache, ryan.roberts, dev.jain, hannes,
	usamaarif642, gutierrez.asier, ast, daniel, andrii, bpf, linux-mm

>>
>> We discussed this yesterday at a THP upstream meeting, and what we
>> should look into is:
>>
>> (1) Having a callback like
>>
>> unsigned int (*get_suggested_order)(.., bool in_pagefault);
> 
> This interface meets our needs precisely, enabling allocation orders
> of either 0 or 9 as required by our workloads.
> 
>>
>> Where we can provide some information about the fault (vma
>> size/flags/anon_name), and whether we are in the page fault (or in
>> khugepaged).
>>
>> Maybe we want a bitmap of orders to try (fallback), not sure yet.
>>
>> (2) Having some way to tag these callbacks as "this is absolutely
>> unstable for now and can be changed as we please.".
> 
> BPF has already helped us complete this, so we don’t need to implement
> this restriction.
> Note that all BPF kfuncs (including struct_ops) are currently unstable
> and may change in the future.
 > > Alexei, could you confirm this understanding?

Every MM person I talked to about this was like "as soon as it's 
actively used out there (e.g., a distro supports it), there is no way 
you can easily change these callbacks ever again - it will just silently 
become stable."

That is actually the biggest concern from the MM side: being stuck with 
an interface that was promised to be "unstable" but suddenly it's 
not-so-unstable anymore, and we have to support something that is very 
likely to be changed in the future.

Which guarantees do we have in the regard?

How can we make it clear to anybody using this specific interface that 
"if you depend on this being stable, you should learn how to read and 
you are to blame, not the MM people" ?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment
  2025-07-20 15:56         ` David Hildenbrand
@ 2025-07-22  2:40           ` Yafang Shao
  2025-07-22  7:28             ` David Hildenbrand
  0 siblings, 1 reply; 33+ messages in thread
From: Yafang Shao @ 2025-07-22  2:40 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexei Starovoitov, Matthew Wilcox, akpm, ziy, baolin.wang,
	lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, dev.jain,
	hannes, usamaarif642, gutierrez.asier, ast, daniel, andrii, bpf,
	linux-mm

On Sun, Jul 20, 2025 at 11:56 PM David Hildenbrand <david@redhat.com> wrote:
>
> >>
> >> We discussed this yesterday at a THP upstream meeting, and what we
> >> should look into is:
> >>
> >> (1) Having a callback like
> >>
> >> unsigned int (*get_suggested_order)(.., bool in_pagefault);
> >
> > This interface meets our needs precisely, enabling allocation orders
> > of either 0 or 9 as required by our workloads.
> >
> >>
> >> Where we can provide some information about the fault (vma
> >> size/flags/anon_name), and whether we are in the page fault (or in
> >> khugepaged).
> >>
> >> Maybe we want a bitmap of orders to try (fallback), not sure yet.
> >>
> >> (2) Having some way to tag these callbacks as "this is absolutely
> >> unstable for now and can be changed as we please.".
> >
> > BPF has already helped us complete this, so we don’t need to implement
> > this restriction.
> > Note that all BPF kfuncs (including struct_ops) are currently unstable
> > and may change in the future.
>  > > Alexei, could you confirm this understanding?
>
> Every MM person I talked to about this was like "as soon as it's
> actively used out there (e.g., a distro supports it), there is no way
> you can easily change these callbacks ever again - it will just silently
> become stable."
>
> That is actually the biggest concern from the MM side: being stuck with
> an interface that was promised to be "unstable" but suddenly it's
> not-so-unstable anymore, and we have to support something that is very
> likely to be changed in the future.
>
> Which guarantees do we have in the regard?
>
> How can we make it clear to anybody using this specific interface that
> "if you depend on this being stable, you should learn how to read and
> you are to blame, not the MM people" ?

As explained in the kernel document [0]:

kfuncs provide a kernel <-> kernel API, and thus are not bound by any
of the strict stability restrictions associated with kernel <-> user
UAPIs. This means they can be thought of as similar to
EXPORT_SYMBOL_GPL, and can therefore be modified or removed by a
maintainer of the subsystem they’re defined in when it’s deemed
necessary.

[0] https://docs.kernel.org/bpf/kfuncs.html#bpf-kfunc-lifecycle-expectations

That said, users of BPF kfuncs should treat them as inherently
unstable and take responsibility for verifying their compatibility
when switching kernel versions. However, this does not imply that BPF
kfuncs can be modified arbitrarily.

For widely adopted kfuncs that deliver substantial value, changes
should be made cautiously—preferably through backward-compatible
extensions to ensure continued functionality across new kernel
versions. Removal should only be considered in exceptional cases, such
as:
- Severe, unfixable issues within the kernel
- Maintenance burdens that block new features or critical improvements.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment
  2025-07-22  2:40           ` Yafang Shao
@ 2025-07-22  7:28             ` David Hildenbrand
  2025-07-22 10:09               ` Lorenzo Stoakes
  2025-07-22 11:46               ` Yafang Shao
  0 siblings, 2 replies; 33+ messages in thread
From: David Hildenbrand @ 2025-07-22  7:28 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Alexei Starovoitov, Matthew Wilcox, akpm, ziy, baolin.wang,
	lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, dev.jain,
	hannes, usamaarif642, gutierrez.asier, ast, daniel, andrii, bpf,
	linux-mm

On 22.07.25 04:40, Yafang Shao wrote:
> On Sun, Jul 20, 2025 at 11:56 PM David Hildenbrand <david@redhat.com> wrote:
>>
>>>>
>>>> We discussed this yesterday at a THP upstream meeting, and what we
>>>> should look into is:
>>>>
>>>> (1) Having a callback like
>>>>
>>>> unsigned int (*get_suggested_order)(.., bool in_pagefault);
>>>
>>> This interface meets our needs precisely, enabling allocation orders
>>> of either 0 or 9 as required by our workloads.
>>>
>>>>
>>>> Where we can provide some information about the fault (vma
>>>> size/flags/anon_name), and whether we are in the page fault (or in
>>>> khugepaged).
>>>>
>>>> Maybe we want a bitmap of orders to try (fallback), not sure yet.
>>>>
>>>> (2) Having some way to tag these callbacks as "this is absolutely
>>>> unstable for now and can be changed as we please.".
>>>
>>> BPF has already helped us complete this, so we don’t need to implement
>>> this restriction.
>>> Note that all BPF kfuncs (including struct_ops) are currently unstable
>>> and may change in the future.
>>   > > Alexei, could you confirm this understanding?
>>
>> Every MM person I talked to about this was like "as soon as it's
>> actively used out there (e.g., a distro supports it), there is no way
>> you can easily change these callbacks ever again - it will just silently
>> become stable."
>>
>> That is actually the biggest concern from the MM side: being stuck with
>> an interface that was promised to be "unstable" but suddenly it's
>> not-so-unstable anymore, and we have to support something that is very
>> likely to be changed in the future.
>>
>> Which guarantees do we have in the regard?
>>
>> How can we make it clear to anybody using this specific interface that
>> "if you depend on this being stable, you should learn how to read and
>> you are to blame, not the MM people" ?
> 
> As explained in the kernel document [0]:
> 
> kfuncs provide a kernel <-> kernel API, and thus are not bound by any
> of the strict stability restrictions associated with kernel <-> user
> UAPIs. This means they can be thought of as similar to
> EXPORT_SYMBOL_GPL, and can therefore be modified or removed by a
> maintainer of the subsystem they’re defined in when it’s deemed
> necessary.
> 
> [0] https://docs.kernel.org/bpf/kfuncs.html#bpf-kfunc-lifecycle-expectations
> 
> That said, users of BPF kfuncs should treat them as inherently
> unstable and take responsibility for verifying their compatibility
> when switching kernel versions. However, this does not imply that BPF
> kfuncs can be modified arbitrarily.
> 
> For widely adopted kfuncs that deliver substantial value, changes
> should be made cautiously—preferably through backward-compatible
> extensions to ensure continued functionality across new kernel
> versions. Removal should only be considered in exceptional cases, such
> as:
> - Severe, unfixable issues within the kernel
> - Maintenance burdens that block new features or critical improvements.

And that is exactly what we are worried about.

You don't know beforehand whether something will be "widely adopted".

Even if there is the "A kfunc will never have any hard stability 
guarantees." in there.

The concerning bit is:

"kfuncs that are widely used or have been in the kernel for a long time 
will be more difficult to justify being changed or removed by a 
maintainer. "

Just no. Not going to happen for the kfuncs we know upfront (like here) 
will stand in our way in the future at some point and *will* be changed 
one way or another.


So for these kfuncs I want a clear way of expressing "whatever the 
kfuncs doc says, this here is completely unstable even if widely used"

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment
  2025-07-22  7:28             ` David Hildenbrand
@ 2025-07-22 10:09               ` Lorenzo Stoakes
  2025-07-22 11:56                 ` Yafang Shao
  2025-07-22 11:46               ` Yafang Shao
  1 sibling, 1 reply; 33+ messages in thread
From: Lorenzo Stoakes @ 2025-07-22 10:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Yafang Shao, Alexei Starovoitov, Matthew Wilcox, akpm, ziy,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, hannes,
	usamaarif642, gutierrez.asier, ast, daniel, andrii, bpf, linux-mm

On Tue, Jul 22, 2025 at 09:28:02AM +0200, David Hildenbrand wrote:
> On 22.07.25 04:40, Yafang Shao wrote:
> > On Sun, Jul 20, 2025 at 11:56 PM David Hildenbrand <david@redhat.com> wrote:
> > >
> > > > >
> > > > > We discussed this yesterday at a THP upstream meeting, and what we
> > > > > should look into is:
> > > > >
> > > > > (1) Having a callback like
> > > > >
> > > > > unsigned int (*get_suggested_order)(.., bool in_pagefault);
> > > >
> > > > This interface meets our needs precisely, enabling allocation orders
> > > > of either 0 or 9 as required by our workloads.

That's great to hear, and to be clear my views align with David on this - I
feel like having a _carefully chosen_ BPF interface could be valuable here,
especially in the short to medium term where it will allow us to more
rapidly iterate on an automated [m]THP mechanism.

I think one key question here is - do we want to retain a _permanent_ BPF
hook here?

In any cae, for the first experiments with this we absolutely _must_ be
able to express that this is going away, NO, not based on whether it's
widely used, it IS going away.

> > > >
> > > > >
> > > > > Where we can provide some information about the fault (vma
> > > > > size/flags/anon_name), and whether we are in the page fault (or in
> > > > > khugepaged).
> > > > >
> > > > > Maybe we want a bitmap of orders to try (fallback), not sure yet.
> > > > >
> > > > > (2) Having some way to tag these callbacks as "this is absolutely
> > > > > unstable for now and can be changed as we please.".
> > > >
> > > > BPF has already helped us complete this, so we don’t need to implement
> > > > this restriction.
> > > > Note that all BPF kfuncs (including struct_ops) are currently unstable
> > > > and may change in the future.
> > >   > > Alexei, could you confirm this understanding?
> > >
> > > Every MM person I talked to about this was like "as soon as it's
> > > actively used out there (e.g., a distro supports it), there is no way
> > > you can easily change these callbacks ever again - it will just silently
> > > become stable."
> > >
> > > That is actually the biggest concern from the MM side: being stuck with
> > > an interface that was promised to be "unstable" but suddenly it's
> > > not-so-unstable anymore, and we have to support something that is very
> > > likely to be changed in the future.
> > >
> > > Which guarantees do we have in the regard?
> > >
> > > How can we make it clear to anybody using this specific interface that
> > > "if you depend on this being stable, you should learn how to read and
> > > you are to blame, not the MM people" ?
> >
> > As explained in the kernel document [0]:
> >
> > kfuncs provide a kernel <-> kernel API, and thus are not bound by any
> > of the strict stability restrictions associated with kernel <-> user
> > UAPIs. This means they can be thought of as similar to
> > EXPORT_SYMBOL_GPL, and can therefore be modified or removed by a
> > maintainer of the subsystem they’re defined in when it’s deemed
> > necessary.

I find this documentation super contradictory. I'm sorry but you can't
have:

"...can therefore be modified or removed by a maintainer of the subsystem
 they’re defined in when it’s deemed necessary."

And:

"kfuncs that are widely used or have been in the kernel for a long time
will be more difficult to justify being changed or removed by a
maintainer."

At the same time. Let alone:

"A kfunc will never have any hard stability guarantees. BPF APIs cannot and
will not ever hard-block a change in the kernel purely for stability
reasons"

Make your mind up!!

I mean the EXPORT_SYMBOL_GPL() example isn't accurate AT ALL - we can
_absolutely_ change or remove those _at will_ as we don't care about
external modules.

Really this seems to be saying, in not so many words, that this is
basically a kAPI and you can't change it.

So this strictly violates what we need here.


> >
> > [0] https://docs.kernel.org/bpf/kfuncs.html#bpf-kfunc-lifecycle-expectations
> >
> > That said, users of BPF kfuncs should treat them as inherently
> > unstable and take responsibility for verifying their compatibility
> > when switching kernel versions. However, this does not imply that BPF
> > kfuncs can be modified arbitrarily.
> >
> > For widely adopted kfuncs that deliver substantial value, changes
> > should be made cautiously—preferably through backward-compatible
> > extensions to ensure continued functionality across new kernel
> > versions. Removal should only be considered in exceptional cases, such
> > as:
> > - Severe, unfixable issues within the kernel
> > - Maintenance burdens that block new features or critical improvements.
>
> And that is exactly what we are worried about.
>
> You don't know beforehand whether something will be "widely adopted".
>
> Even if there is the "A kfunc will never have any hard stability
> guarantees." in there.
>
> The concerning bit is:
>
> "kfuncs that are widely used or have been in the kernel for a long time will
> be more difficult to justify being changed or removed by a maintainer. "
>
> Just no. Not going to happen for the kfuncs we know upfront (like here) will
> stand in our way in the future at some point and *will* be changed one way
> or another.

Yes, and the EXPORT*() example is plain wrong in that document.

>
>
> So for these kfuncs I want a clear way of expressing "whatever the kfuncs
> doc says, this here is completely unstable even if widely used"

I wonder if we can use a CONFIG_xxx and put this behind that, which
specifically says 'WE WILL REMOVE THIS'
CONFIG_EXPERIMENTAL_DO_NOT_USE_THP_THINGY :P

>
> --
> Cheers,
>
> David / dhildenb
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment
  2025-07-22  7:28             ` David Hildenbrand
  2025-07-22 10:09               ` Lorenzo Stoakes
@ 2025-07-22 11:46               ` Yafang Shao
  2025-07-22 11:54                 ` Lorenzo Stoakes
  1 sibling, 1 reply; 33+ messages in thread
From: Yafang Shao @ 2025-07-22 11:46 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alexei Starovoitov, Matthew Wilcox, akpm, ziy, baolin.wang,
	lorenzo.stoakes, Liam.Howlett, npache, ryan.roberts, dev.jain,
	hannes, usamaarif642, gutierrez.asier, ast, daniel, andrii, bpf,
	linux-mm

On Tue, Jul 22, 2025 at 3:28 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 22.07.25 04:40, Yafang Shao wrote:
> > On Sun, Jul 20, 2025 at 11:56 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >>>>
> >>>> We discussed this yesterday at a THP upstream meeting, and what we
> >>>> should look into is:
> >>>>
> >>>> (1) Having a callback like
> >>>>
> >>>> unsigned int (*get_suggested_order)(.., bool in_pagefault);
> >>>
> >>> This interface meets our needs precisely, enabling allocation orders
> >>> of either 0 or 9 as required by our workloads.
> >>>
> >>>>
> >>>> Where we can provide some information about the fault (vma
> >>>> size/flags/anon_name), and whether we are in the page fault (or in
> >>>> khugepaged).
> >>>>
> >>>> Maybe we want a bitmap of orders to try (fallback), not sure yet.
> >>>>
> >>>> (2) Having some way to tag these callbacks as "this is absolutely
> >>>> unstable for now and can be changed as we please.".
> >>>
> >>> BPF has already helped us complete this, so we don’t need to implement
> >>> this restriction.
> >>> Note that all BPF kfuncs (including struct_ops) are currently unstable
> >>> and may change in the future.
> >>   > > Alexei, could you confirm this understanding?
> >>
> >> Every MM person I talked to about this was like "as soon as it's
> >> actively used out there (e.g., a distro supports it), there is no way
> >> you can easily change these callbacks ever again - it will just silently
> >> become stable."
> >>
> >> That is actually the biggest concern from the MM side: being stuck with
> >> an interface that was promised to be "unstable" but suddenly it's
> >> not-so-unstable anymore, and we have to support something that is very
> >> likely to be changed in the future.
> >>
> >> Which guarantees do we have in the regard?
> >>
> >> How can we make it clear to anybody using this specific interface that
> >> "if you depend on this being stable, you should learn how to read and
> >> you are to blame, not the MM people" ?
> >
> > As explained in the kernel document [0]:
> >
> > kfuncs provide a kernel <-> kernel API, and thus are not bound by any
> > of the strict stability restrictions associated with kernel <-> user
> > UAPIs. This means they can be thought of as similar to
> > EXPORT_SYMBOL_GPL, and can therefore be modified or removed by a
> > maintainer of the subsystem they’re defined in when it’s deemed
> > necessary.
> >
> > [0] https://docs.kernel.org/bpf/kfuncs.html#bpf-kfunc-lifecycle-expectations
> >
> > That said, users of BPF kfuncs should treat them as inherently
> > unstable and take responsibility for verifying their compatibility
> > when switching kernel versions. However, this does not imply that BPF
> > kfuncs can be modified arbitrarily.
> >
> > For widely adopted kfuncs that deliver substantial value, changes
> > should be made cautiously—preferably through backward-compatible
> > extensions to ensure continued functionality across new kernel
> > versions. Removal should only be considered in exceptional cases, such
> > as:
> > - Severe, unfixable issues within the kernel
> > - Maintenance burdens that block new features or critical improvements.
>
> And that is exactly what we are worried about.
>
> You don't know beforehand whether something will be "widely adopted".
>
> Even if there is the "A kfunc will never have any hard stability
> guarantees." in there.
>
> The concerning bit is:
>
> "kfuncs that are widely used or have been in the kernel for a long time
> will be more difficult to justify being changed or removed by a
> maintainer. "
>
> Just no. Not going to happen for the kfuncs we know upfront (like here)
> will stand in our way in the future at some point and *will* be changed
> one way or another.
>
>
> So for these kfuncs I want a clear way of expressing "whatever the
> kfuncs doc says, this here is completely unstable even if widely used"

This statement does not conflict with the BPF kfuncs documentation, as
it explicitly states:
"This means they can be thought of as similar to EXPORT_SYMBOL_GPL,
and can therefore be modified or removed by a maintainer of the
subsystem they're defined in when deemed necessary."

There is no question that subsystem maintainers have the authority to
remove kfuncs. However, the reason I raised the issue of removing
widely used kfuncs is to highlight the recommended practice:
- First mark the kfunc as KF_DEPRECATED.
- Remove it in the next development cycle.

While this is not a strict requirement—maintainers can remove kfuncs
immediately without deprecation—following this guideline helps avoid
unnecessary disruptions for users.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment
  2025-07-22 11:46               ` Yafang Shao
@ 2025-07-22 11:54                 ` Lorenzo Stoakes
  2025-07-22 12:02                   ` Yafang Shao
  0 siblings, 1 reply; 33+ messages in thread
From: Lorenzo Stoakes @ 2025-07-22 11:54 UTC (permalink / raw)
  To: Yafang Shao
  Cc: David Hildenbrand, Alexei Starovoitov, Matthew Wilcox, akpm, ziy,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, hannes,
	usamaarif642, gutierrez.asier, ast, daniel, andrii, bpf, linux-mm

On Tue, Jul 22, 2025 at 07:46:57PM +0800, Yafang Shao wrote:
> > So for these kfuncs I want a clear way of expressing "whatever the
> > kfuncs doc says, this here is completely unstable even if widely used"
>
> This statement does not conflict with the BPF kfuncs documentation, as
> it explicitly states:
> "This means they can be thought of as similar to EXPORT_SYMBOL_GPL,
> and can therefore be modified or removed by a maintainer of the
> subsystem they're defined in when deemed necessary."

Except that's not how EXPORT_SYMBOL_GPL() works at all, we can remove at
will and are only required to update in-kernel users.

So that comparison is simply bogus.

>
> There is no question that subsystem maintainers have the authority to
> remove kfuncs. However, the reason I raised the issue of removing

Except the documentation that seems to very strongly suggest otherwise?

> widely used kfuncs is to highlight the recommended practice:
> - First mark the kfunc as KF_DEPRECATED.
> - Remove it in the next development cycle.

This seems reasonable, but I'm not in the slightest confident in us just
relying on this.

>
> While this is not a strict requirement—maintainers can remove kfuncs
> immediately without deprecation—following this guideline helps avoid
> unnecessary disruptions for users.

The documentation doesn't state this, you are surely just inferring it?

>
> --
> Regards
> Yafang

Overall I think we need a different mechanism in addition to this, such as
a very clearly described CONFIG_ option that makes it ABUNDANTLY clear that
the config and thus the related BPF hook may be removed at any time.

Ideally with 'experimental' or such in the name, or perhaps even tainting
the kernel.

We definitely need something better than what this documentation is saying,
sorry. I am not in any way confident in relying no what this document
states.

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment
  2025-07-22 10:09               ` Lorenzo Stoakes
@ 2025-07-22 11:56                 ` Yafang Shao
  2025-07-22 12:04                   ` Lorenzo Stoakes
  0 siblings, 1 reply; 33+ messages in thread
From: Yafang Shao @ 2025-07-22 11:56 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand, Alexei Starovoitov, Matthew Wilcox, akpm, ziy,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, hannes,
	usamaarif642, gutierrez.asier, ast, daniel, andrii, bpf, linux-mm

On Tue, Jul 22, 2025 at 6:09 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Tue, Jul 22, 2025 at 09:28:02AM +0200, David Hildenbrand wrote:
> > On 22.07.25 04:40, Yafang Shao wrote:
> > > On Sun, Jul 20, 2025 at 11:56 PM David Hildenbrand <david@redhat.com> wrote:
> > > >
> > > > > >
> > > > > > We discussed this yesterday at a THP upstream meeting, and what we
> > > > > > should look into is:
> > > > > >
> > > > > > (1) Having a callback like
> > > > > >
> > > > > > unsigned int (*get_suggested_order)(.., bool in_pagefault);
> > > > >
> > > > > This interface meets our needs precisely, enabling allocation orders
> > > > > of either 0 or 9 as required by our workloads.
>
> That's great to hear, and to be clear my views align with David on this - I
> feel like having a _carefully chosen_ BPF interface could be valuable here,
> especially in the short to medium term where it will allow us to more
> rapidly iterate on an automated [m]THP mechanism.
>
> I think one key question here is - do we want to retain a _permanent_ BPF
> hook here?
>
> In any cae, for the first experiments with this we absolutely _must_ be
> able to express that this is going away, NO, not based on whether it's
> widely used, it IS going away.

If this BPF kfunc provides clear user value with minimal maintenance
overhead, what would be the rationale for removing it? That said, if
you and David both agree it should be deprecated, I won't object -
though I'd suggest following the standard deprecation process.

>
> > > > >
> > > > > >
> > > > > > Where we can provide some information about the fault (vma
> > > > > > size/flags/anon_name), and whether we are in the page fault (or in
> > > > > > khugepaged).
> > > > > >
> > > > > > Maybe we want a bitmap of orders to try (fallback), not sure yet.
> > > > > >
> > > > > > (2) Having some way to tag these callbacks as "this is absolutely
> > > > > > unstable for now and can be changed as we please.".
> > > > >
> > > > > BPF has already helped us complete this, so we don’t need to implement
> > > > > this restriction.
> > > > > Note that all BPF kfuncs (including struct_ops) are currently unstable
> > > > > and may change in the future.
> > > >   > > Alexei, could you confirm this understanding?
> > > >
> > > > Every MM person I talked to about this was like "as soon as it's
> > > > actively used out there (e.g., a distro supports it), there is no way
> > > > you can easily change these callbacks ever again - it will just silently
> > > > become stable."
> > > >
> > > > That is actually the biggest concern from the MM side: being stuck with
> > > > an interface that was promised to be "unstable" but suddenly it's
> > > > not-so-unstable anymore, and we have to support something that is very
> > > > likely to be changed in the future.
> > > >
> > > > Which guarantees do we have in the regard?
> > > >
> > > > How can we make it clear to anybody using this specific interface that
> > > > "if you depend on this being stable, you should learn how to read and
> > > > you are to blame, not the MM people" ?
> > >
> > > As explained in the kernel document [0]:
> > >
> > > kfuncs provide a kernel <-> kernel API, and thus are not bound by any
> > > of the strict stability restrictions associated with kernel <-> user
> > > UAPIs. This means they can be thought of as similar to
> > > EXPORT_SYMBOL_GPL, and can therefore be modified or removed by a
> > > maintainer of the subsystem they’re defined in when it’s deemed
> > > necessary.
>
> I find this documentation super contradictory. I'm sorry but you can't
> have:
>
> "...can therefore be modified or removed by a maintainer of the subsystem
>  they’re defined in when it’s deemed necessary."
>
> And:
>
> "kfuncs that are widely used or have been in the kernel for a long time
> will be more difficult to justify being changed or removed by a
> maintainer."
>
> At the same time. Let alone:
>
> "A kfunc will never have any hard stability guarantees. BPF APIs cannot and
> will not ever hard-block a change in the kernel purely for stability
> reasons"
>
> Make your mind up!!
>
> I mean the EXPORT_SYMBOL_GPL() example isn't accurate AT ALL - we can
> _absolutely_ change or remove those _at will_ as we don't care about
> external modules.
>
> Really this seems to be saying, in not so many words, that this is
> basically a kAPI and you can't change it.
>
> So this strictly violates what we need here.

The maintainers have the authority to make the final determination ;-)
>
>
> > >
> > > [0] https://docs.kernel.org/bpf/kfuncs.html#bpf-kfunc-lifecycle-expectations
> > >
> > > That said, users of BPF kfuncs should treat them as inherently
> > > unstable and take responsibility for verifying their compatibility
> > > when switching kernel versions. However, this does not imply that BPF
> > > kfuncs can be modified arbitrarily.
> > >
> > > For widely adopted kfuncs that deliver substantial value, changes
> > > should be made cautiously—preferably through backward-compatible
> > > extensions to ensure continued functionality across new kernel
> > > versions. Removal should only be considered in exceptional cases, such
> > > as:
> > > - Severe, unfixable issues within the kernel
> > > - Maintenance burdens that block new features or critical improvements.
> >
> > And that is exactly what we are worried about.
> >
> > You don't know beforehand whether something will be "widely adopted".
> >
> > Even if there is the "A kfunc will never have any hard stability
> > guarantees." in there.
> >
> > The concerning bit is:
> >
> > "kfuncs that are widely used or have been in the kernel for a long time will
> > be more difficult to justify being changed or removed by a maintainer. "
> >
> > Just no. Not going to happen for the kfuncs we know upfront (like here) will
> > stand in our way in the future at some point and *will* be changed one way
> > or another.
>
> Yes, and the EXPORT*() example is plain wrong in that document.
>
> >
> >
> > So for these kfuncs I want a clear way of expressing "whatever the kfuncs
> > doc says, this here is completely unstable even if widely used"
>
> I wonder if we can use a CONFIG_xxx and put this behind that, which
> specifically says 'WE WILL REMOVE THIS'
> CONFIG_EXPERIMENTAL_DO_NOT_USE_THP_THINGY :P

That's a reasonable suggestion. We could implement this function under
CONFIG_EXPERIMENTAL to mark it as experimental infrastructure.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment
  2025-07-22 11:54                 ` Lorenzo Stoakes
@ 2025-07-22 12:02                   ` Yafang Shao
  2025-07-22 12:08                     ` Lorenzo Stoakes
  0 siblings, 1 reply; 33+ messages in thread
From: Yafang Shao @ 2025-07-22 12:02 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand, Alexei Starovoitov, Matthew Wilcox, akpm, ziy,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, hannes,
	usamaarif642, gutierrez.asier, ast, daniel, andrii, bpf, linux-mm

On Tue, Jul 22, 2025 at 7:55 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Tue, Jul 22, 2025 at 07:46:57PM +0800, Yafang Shao wrote:
> > > So for these kfuncs I want a clear way of expressing "whatever the
> > > kfuncs doc says, this here is completely unstable even if widely used"
> >
> > This statement does not conflict with the BPF kfuncs documentation, as
> > it explicitly states:
> > "This means they can be thought of as similar to EXPORT_SYMBOL_GPL,
> > and can therefore be modified or removed by a maintainer of the
> > subsystem they're defined in when deemed necessary."
>
> Except that's not how EXPORT_SYMBOL_GPL() works at all, we can remove at
> will and are only required to update in-kernel users.
>
> So that comparison is simply bogus.
>
> >
> > There is no question that subsystem maintainers have the authority to
> > remove kfuncs. However, the reason I raised the issue of removing
>
> Except the documentation that seems to very strongly suggest otherwise?
>
> > widely used kfuncs is to highlight the recommended practice:
> > - First mark the kfunc as KF_DEPRECATED.
> > - Remove it in the next development cycle.
>
> This seems reasonable, but I'm not in the slightest confident in us just
> relying on this.
>
> >
> > While this is not a strict requirement—maintainers can remove kfuncs
> > immediately without deprecation—following this guideline helps avoid
> > unnecessary disruptions for users.
>
> The documentation doesn't state this, you are surely just inferring it?

As noted in the other thread, the maintainers ultimately have final
say on this matter. I'm simply sharing my perspective here.

>
> >
> > --
> > Regards
> > Yafang
>
> Overall I think we need a different mechanism in addition to this, such as
> a very clearly described CONFIG_ option that makes it ABUNDANTLY clear that
> the config and thus the related BPF hook may be removed at any time.
>
> Ideally with 'experimental' or such in the name, or perhaps even tainting
> the kernel.

Agreed.

>
> We definitely need something better than what this documentation is saying,
> sorry. I am not in any way confident in relying no what this document
> states.

The documentation has always been difficult to understand ;-)

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment
  2025-07-22 11:56                 ` Yafang Shao
@ 2025-07-22 12:04                   ` Lorenzo Stoakes
  2025-07-22 12:16                     ` Yafang Shao
  0 siblings, 1 reply; 33+ messages in thread
From: Lorenzo Stoakes @ 2025-07-22 12:04 UTC (permalink / raw)
  To: Yafang Shao
  Cc: David Hildenbrand, Alexei Starovoitov, Matthew Wilcox, akpm, ziy,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, hannes,
	usamaarif642, gutierrez.asier, ast, daniel, andrii, bpf, linux-mm

On Tue, Jul 22, 2025 at 07:56:21PM +0800, Yafang Shao wrote:
> On Tue, Jul 22, 2025 at 6:09 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Tue, Jul 22, 2025 at 09:28:02AM +0200, David Hildenbrand wrote:
> > > On 22.07.25 04:40, Yafang Shao wrote:
> > > > On Sun, Jul 20, 2025 at 11:56 PM David Hildenbrand <david@redhat.com> wrote:
> > > > >
> > > > > > >
> > > > > > > We discussed this yesterday at a THP upstream meeting, and what we
> > > > > > > should look into is:
> > > > > > >
> > > > > > > (1) Having a callback like
> > > > > > >
> > > > > > > unsigned int (*get_suggested_order)(.., bool in_pagefault);
> > > > > >
> > > > > > This interface meets our needs precisely, enabling allocation orders
> > > > > > of either 0 or 9 as required by our workloads.
> >
> > That's great to hear, and to be clear my views align with David on this - I
> > feel like having a _carefully chosen_ BPF interface could be valuable here,
> > especially in the short to medium term where it will allow us to more
> > rapidly iterate on an automated [m]THP mechanism.
> >
> > I think one key question here is - do we want to retain a _permanent_ BPF
> > hook here?
> >
> > In any cae, for the first experiments with this we absolutely _must_ be
> > able to express that this is going away, NO, not based on whether it's
> > widely used, it IS going away.
>
> If this BPF kfunc provides clear user value with minimal maintenance
> overhead, what would be the rationale for removing it? That said, if
> you and David both agree it should be deprecated, I won't object -
> though I'd suggest following the standard deprecation process.

You see herein lies the problem... :) from my point of view we want to have
the ability to choose, fundamentally.

We may find out the proposed interface is unworkable, or sets assumptions
we don't want to make.

So I think hiding ehhind a CONFIG_ flag is the best idea here to really
enforce that and make it clear.

Personally I have a sense that we _will_ introduce something permanent. We
just need to have the 'space' to positively decide to do that once the
experimentation is complete.

> > I find this documentation super contradictory. I'm sorry but you can't
> > have:
> >
> > "...can therefore be modified or removed by a maintainer of the subsystem
> >  they’re defined in when it’s deemed necessary."
> >
> > And:
> >
> > "kfuncs that are widely used or have been in the kernel for a long time
> > will be more difficult to justify being changed or removed by a
> > maintainer."
> >
> > At the same time. Let alone:
> >
> > "A kfunc will never have any hard stability guarantees. BPF APIs cannot and
> > will not ever hard-block a change in the kernel purely for stability
> > reasons"
> >
> > Make your mind up!!
> >
> > I mean the EXPORT_SYMBOL_GPL() example isn't accurate AT ALL - we can
> > _absolutely_ change or remove those _at will_ as we don't care about
> > external modules.
> >
> > Really this seems to be saying, in not so many words, that this is
> > basically a kAPI and you can't change it.
> >
> > So this strictly violates what we need here.
>
> The maintainers have the authority to make the final determination ;-)

Well the kernel doesn't entirely work this way... pressure can come which
impacts what others may do.

If you have people saying 'hey we rely on this and removing it will break
our cloud deployment' and 'hey I checked the docs and it says you guys have
to take this into account', I am not so sure Andrew or Linus will accept
the patch.

> > I wonder if we can use a CONFIG_xxx and put this behind that, which
> > specifically says 'WE WILL REMOVE THIS'
> > CONFIG_EXPERIMENTAL_DO_NOT_USE_THP_THINGY :P
>
> That's a reasonable suggestion. We could implement this function under
> CONFIG_EXPERIMENTAL to mark it as experimental infrastructure.

Thanks! Yes, I was looking for this flag :P didn't know if we still had
that or not actually...

But, yeah, putting it behind that explicitly also makes it very clearly.

CONFIG_EXPERIMENTAL_BPF_FAULT_ORDER relies on CONFIG_EXPERIMENTAL makes it
you know... pretty clear ;)

>
> --
> Regards
> Yafang

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment
  2025-07-22 12:02                   ` Yafang Shao
@ 2025-07-22 12:08                     ` Lorenzo Stoakes
  0 siblings, 0 replies; 33+ messages in thread
From: Lorenzo Stoakes @ 2025-07-22 12:08 UTC (permalink / raw)
  To: Yafang Shao
  Cc: David Hildenbrand, Alexei Starovoitov, Matthew Wilcox, akpm, ziy,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, hannes,
	usamaarif642, gutierrez.asier, ast, daniel, andrii, bpf, linux-mm

On Tue, Jul 22, 2025 at 08:02:15PM +0800, Yafang Shao wrote:
> On Tue, Jul 22, 2025 at 7:55 PM Lorenzo Stoakes
> > >
> > > While this is not a strict requirement—maintainers can remove kfuncs
> > > immediately without deprecation—following this guideline helps avoid
> > > unnecessary disruptions for users.
> >
> > The documentation doesn't state this, you are surely just inferring it?
>
> As noted in the other thread, the maintainers ultimately have final
> say on this matter. I'm simply sharing my perspective here.

Sure, sorry, I was probably being a bit too succinct in my reply here :P
see other thread, I think the situation won't necessarily be quite as clear
in practice.

>
> >
> > >
> > > --
> > > Regards
> > > Yafang
> >
> > Overall I think we need a different mechanism in addition to this, such as
> > a very clearly described CONFIG_ option that makes it ABUNDANTLY clear that
> > the config and thus the related BPF hook may be removed at any time.
> >
> > Ideally with 'experimental' or such in the name, or perhaps even tainting
> > the kernel.
>
> Agreed.

Thanks, discussed in other thread also.

>
> >
> > We definitely need something better than what this documentation is saying,
> > sorry. I am not in any way confident in relying no what this document
> > states.
>
> The documentation has always been difficult to understand ;-)

:) I think it is being rather too cautious or really circumspect in its
wording, and could do with being a lot more direct, in my view.

But I suppose since organising kernel developers into agreeing on something
is like herding cats, it might be difficult to get agreement to the point
where the documentation could really be that.

>
> --
> Regards
> Yafang

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment
  2025-07-22 12:04                   ` Lorenzo Stoakes
@ 2025-07-22 12:16                     ` Yafang Shao
  0 siblings, 0 replies; 33+ messages in thread
From: Yafang Shao @ 2025-07-22 12:16 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand, Alexei Starovoitov, Matthew Wilcox, akpm, ziy,
	baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, hannes,
	usamaarif642, gutierrez.asier, ast, daniel, andrii, bpf, linux-mm

On Tue, Jul 22, 2025 at 8:05 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Tue, Jul 22, 2025 at 07:56:21PM +0800, Yafang Shao wrote:
> > On Tue, Jul 22, 2025 at 6:09 PM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > On Tue, Jul 22, 2025 at 09:28:02AM +0200, David Hildenbrand wrote:
> > > > On 22.07.25 04:40, Yafang Shao wrote:
> > > > > On Sun, Jul 20, 2025 at 11:56 PM David Hildenbrand <david@redhat.com> wrote:
> > > > > >
> > > > > > > >
> > > > > > > > We discussed this yesterday at a THP upstream meeting, and what we
> > > > > > > > should look into is:
> > > > > > > >
> > > > > > > > (1) Having a callback like
> > > > > > > >
> > > > > > > > unsigned int (*get_suggested_order)(.., bool in_pagefault);
> > > > > > >
> > > > > > > This interface meets our needs precisely, enabling allocation orders
> > > > > > > of either 0 or 9 as required by our workloads.
> > >
> > > That's great to hear, and to be clear my views align with David on this - I
> > > feel like having a _carefully chosen_ BPF interface could be valuable here,
> > > especially in the short to medium term where it will allow us to more
> > > rapidly iterate on an automated [m]THP mechanism.
> > >
> > > I think one key question here is - do we want to retain a _permanent_ BPF
> > > hook here?
> > >
> > > In any cae, for the first experiments with this we absolutely _must_ be
> > > able to express that this is going away, NO, not based on whether it's
> > > widely used, it IS going away.
> >
> > If this BPF kfunc provides clear user value with minimal maintenance
> > overhead, what would be the rationale for removing it? That said, if
> > you and David both agree it should be deprecated, I won't object -
> > though I'd suggest following the standard deprecation process.
>
> You see herein lies the problem... :) from my point of view we want to have
> the ability to choose, fundamentally.
>
> We may find out the proposed interface is unworkable, or sets assumptions
> we don't want to make.
>
> So I think hiding ehhind a CONFIG_ flag is the best idea here to really
> enforce that and make it clear.
>
> Personally I have a sense that we _will_ introduce something permanent. We
> just need to have the 'space' to positively decide to do that once the
> experimentation is complete.

Thanks for your explanation.

>
> > > I find this documentation super contradictory. I'm sorry but you can't
> > > have:
> > >
> > > "...can therefore be modified or removed by a maintainer of the subsystem
> > >  they’re defined in when it’s deemed necessary."
> > >
> > > And:
> > >
> > > "kfuncs that are widely used or have been in the kernel for a long time
> > > will be more difficult to justify being changed or removed by a
> > > maintainer."
> > >
> > > At the same time. Let alone:
> > >
> > > "A kfunc will never have any hard stability guarantees. BPF APIs cannot and
> > > will not ever hard-block a change in the kernel purely for stability
> > > reasons"
> > >
> > > Make your mind up!!
> > >
> > > I mean the EXPORT_SYMBOL_GPL() example isn't accurate AT ALL - we can
> > > _absolutely_ change or remove those _at will_ as we don't care about
> > > external modules.
> > >
> > > Really this seems to be saying, in not so many words, that this is
> > > basically a kAPI and you can't change it.
> > >
> > > So this strictly violates what we need here.
> >
> > The maintainers have the authority to make the final determination ;-)
>
> Well the kernel doesn't entirely work this way... pressure can come which
> impacts what others may do.
>
> If you have people saying 'hey we rely on this and removing it will break
> our cloud deployment' and 'hey I checked the docs and it says you guys have
> to take this into account', I am not so sure Andrew or Linus will accept
> the patch.

understood.

>
> > > I wonder if we can use a CONFIG_xxx and put this behind that, which
> > > specifically says 'WE WILL REMOVE THIS'
> > > CONFIG_EXPERIMENTAL_DO_NOT_USE_THP_THINGY :P
> >
> > That's a reasonable suggestion. We could implement this function under
> > CONFIG_EXPERIMENTAL to mark it as experimental infrastructure.
>
> Thanks! Yes, I was looking for this flag :P didn't know if we still had
> that or not actually...
>
> But, yeah, putting it behind that explicitly also makes it very clearly.
>
> CONFIG_EXPERIMENTAL_BPF_FAULT_ORDER relies on CONFIG_EXPERIMENTAL makes it
> you know... pretty clear ;)

Agreed. Let's move forward with the CONFIG_EXPERIMENTAL_BPF_FAULT_ORDER option.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2025-07-22 12:16 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-08  7:35 [RFC PATCH v3 0/5] mm, bpf: BPF based THP adjustment Yafang Shao
2025-06-08  7:35 ` [RFC PATCH v3 1/5] mm, thp: use __thp_vma_allowable_orders() in khugepaged_enter_vma() Yafang Shao
2025-07-17 14:48   ` Usama Arif
2025-07-20  2:37     ` Yafang Shao
2025-06-08  7:35 ` [RFC PATCH v3 2/5] mm, thp: add bpf thp hook to determine thp allocator Yafang Shao
2025-07-17 15:30   ` Usama Arif
2025-07-20  3:00     ` Yafang Shao
2025-06-08  7:35 ` [RFC PATCH v3 3/5] mm, thp: add bpf thp hook to determine thp reclaimer Yafang Shao
2025-07-17 16:06   ` Usama Arif
2025-07-20  3:03     ` Yafang Shao
2025-06-08  7:35 ` [RFC PATCH v3 4/5] mm: thp: add bpf thp struct ops Yafang Shao
2025-07-17 16:25   ` Usama Arif
2025-07-17 18:21   ` Amery Hung
2025-07-20  3:07     ` Yafang Shao
2025-06-08  7:35 ` [RFC PATCH v3 5/5] selftests/bpf: Add selftest for THP adjustment Yafang Shao
2025-07-15 22:42 ` [RFC PATCH v3 0/5] mm, bpf: BPF based " David Hildenbrand
2025-07-17  3:09   ` Yafang Shao
2025-07-17  8:52     ` David Hildenbrand
2025-07-17  9:05       ` Lorenzo Stoakes
2025-07-20  2:32       ` Yafang Shao
2025-07-20 15:56         ` David Hildenbrand
2025-07-22  2:40           ` Yafang Shao
2025-07-22  7:28             ` David Hildenbrand
2025-07-22 10:09               ` Lorenzo Stoakes
2025-07-22 11:56                 ` Yafang Shao
2025-07-22 12:04                   ` Lorenzo Stoakes
2025-07-22 12:16                     ` Yafang Shao
2025-07-22 11:46               ` Yafang Shao
2025-07-22 11:54                 ` Lorenzo Stoakes
2025-07-22 12:02                   ` Yafang Shao
2025-07-22 12:08                     ` Lorenzo Stoakes
2025-07-17 16:35 ` Usama Arif
2025-07-20  2:54   ` Yafang Shao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).