[PATCH v7 mm-new 0/9] mm, bpf: BPF based THP order selection

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v7 mm-new 0/9] mm, bpf: BPF based THP order selection
@ 2025-09-10  2:44 Yafang Shao
  2025-09-10  2:44 ` [PATCH v7 mm-new 01/10] mm: thp: remove disabled task from khugepaged_mm_slot Yafang Shao
                   ` (10 more replies)
  0 siblings, 11 replies; 61+ messages in thread
From: Yafang Shao @ 2025-09-10  2:44 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, 21cnbao, shakeel.butt
  Cc: bpf, linux-mm, linux-doc, Yafang Shao

Background
==========

Our production servers consistently configure THP to "never" due to
historical incidents caused by its behavior. Key issues include:
- Increased Memory Consumption
  THP significantly raises overall memory usage, reducing available memory
  for workloads.

- Latency Spikes
  Random latency spikes occur due to frequent memory compaction triggered
  by THP.

- Lack of Fine-Grained Control
  THP tuning is globally configured, making it unsuitable for containerized
  environments. When multiple workloads share a host, enabling THP without
  per-workload control leads to unpredictable behavior.

Due to these issues, administrators avoid switching to madvise or always
modes—unless per-workload THP control is implemented.

To address this, we propose BPF-based THP policy for flexible adjustment.
Additionally, as David mentioned, this mechanism can also serve as a
policy prototyping tool (test policies via BPF before upstreaming them).

Proposed Solution
=================

This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
THP tuning. It includes a hook thp_get_order(), allowing BPF programs to
influence THP order selection based on factors such as:

- Workload identity
  For example, workloads running in specific containers or cgroups.
- Allocation context
  Whether the allocation occurs during a page fault, khugepaged, swap or
  other paths.
- VMA's memory advice settings
  MADV_HUGEPAGE or MADV_NOHUGEPAGE
- Memory pressure
  PSI system data or associated cgroup PSI metrics

The new interface for the BPF program is as follows:

/**
 * @thp_get_order: Get the suggested THP orders from a BPF program for allocation
 * @vma: vm_area_struct associated with the THP allocation
 * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
 *            BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE
 *            if neither is set.
 * @tva_type: TVA type for current @vma
 * @orders: Bitmask of requested THP orders for this allocation
 *          - PMD-mapped allocation if PMD_ORDER is set
 *          - mTHP allocation otherwise
 *
 * Return: The suggested THP order from the BPF program for allocation. It will
 *         not exceed the highest requested order in @orders. Return -1 to
 *         indicate that the original requested @orders should remain unchanged.
 */

int thp_get_order(struct vm_area_struct *vma,
                  enum bpf_thp_vma_type vma_type,
                  enum tva_type tva_type,
                  unsigned long orders);

Only a single BPF program can be attached at any given time, though it can
be dynamically updated to adjust the policy. The implementation supports
anonymous THP, shmem THP, and mTHP, with future extensions planned for
file-backed THP.

This functionality is only active when system-wide THP is configured to
madvise or always mode. It remains disabled in never mode. Additionally,
if THP is explicitly disabled for a specific task via prctl(), this BPF
functionality will also be unavailable for that task

**WARNING**
- This feature requires CONFIG_BPF_GET_THP_ORDER (marked EXPERIMENTAL) to
  be enabled.
- The interface may change
- Behavior may differ in future kernel versions
- We might remove it in the future

Selftests
=========

BPF CI 
------

Patch #7: Implements a basic BPF THP policy that restricts THP allocation
          via khugepaged to tasks within a specified memory cgroup.
Patch #8: Provides tests for dynamic BPF program updates and replacement.
Patch #9: Includes negative tests for invalid BPF helper usage, verifying
          proper verification by the BPF verifier.

Currently, several dependency patches reside in mm-new but haven't been
merged into bpf-next. To enable BPF CI testing, these dependencies were
manually applied to bpf-next. All selftests in this series pass 
successfully [0].

Performance Evaluation
----------------------

Performance impact was measured given the page fault handler modifications.
The standard `perf bench mem memset` benchmark was employed to assess page
fault performance.

Testing was conducted on an AMD EPYC 7W83 64-Core Processor (single NUMA
node). Due to variance between individual test runs, a script executed
10000 iterations to calculate meaningful averages.

- Baseline (without this patch series)
- With patch series but no BPF program attached
- With patch series and BPF program attached

The results across three configurations show negligible performance impact:

  Number of runs: 10,000
  Average throughput: 40-41 GB/sec

Production verification
-----------------------

We have successfully deployed a variant of this approach across numerous
Kubernetes production servers. The implementation enables THP for specific
workloads (such as applications utilizing ZGC [1]) while disabling it for
others. This selective deployment has operated flawlessly, with no
regression reports to date.

For ZGC-based applications, our verification demonstrates that shmem THP
delivers significant improvements:
- Reduced CPU utilization
- Lower average latencies

We are continuously extending its support to more workloads, such as
TCMalloc-based services. [2]

Deployment Steps in our production servers are as follows,

1. Initial Setup:
- Set THP mode to "never" (disabling THP by default).
- Attach the BPF program and pin the BPF maps and links.
- Pinning ensures persistence (like a kernel module), preventing
disruption under system pressure.
- A THP whitelist map tracks allowed cgroups (initially empty -> no THP
allocations).

2. Enable THP Control:
- Switch THP mode to "always" or "madvise" (BPF now governs actual allocations).

3. Dynamic Management:
- To permit THP for a cgroup, add its ID to the whitelist map.
- To revoke permission, remove the cgroup ID from the map.
- The BPF program can be updated live (policy adjustments require no
task interruption).

4. To roll back, disable THP and remove this BPF program. 

**WARNING**
Be aware that the maintainers do not suggest this use case, as the BPF hook
interface is unstable and might be removed from the upstream kernel—unless
you have your own kernel team to maintain it ;-)

Future work
===========

file-backed THP policy
----------------------

Based on our validation with production workloads, we observed mixed
results with XFS large folios (also known as file-backed THP):

- Performance Benefits
  Some workloads demonstrated significant improvements with XFS large
  folios enabled
- Performance Regression
  Some workloads experienced degradation when using XFS large folios

These results demonstrate that File THP, similar to anonymous THP, requires
a more granular approach instead of a uniform implementation.

We will extend the BPF-based order selection mechanism to support
file-backed THP allocation policies.

Hooking fork() with BPF for Task Configuration
----------------------------------------------

The current method for controlling a newly fork()-ed task involves calling
prctl() (e.g., with PR_SET_THP_DISABLE) to set flags in its mm->flags. This
requires explicit userspace modification.

A more efficient alternative is to implement a new BPF hook within the
fork() path. This hook would allow a BPF program to set the task's
mm->flags directly after mm initialization, leveraging BPF helpers for a
solution that is transparent to userspace. This is particularly valuable in
data center environments for fleet-wide management. 

Link: https://github.com/kernel-patches/bpf/pull/9706 [0] 
Link: https://wiki.openjdk.org/display/zgc/Main#Main-EnablingTr... [1]
Link: https://google.github.io/tcmalloc/tuning.html#system-level-optimizations [2]

Changes:
=======:

v6->v7:
Key Changes Implemented Based on Feedback:
From Lorenzo:
  - Rename the hook from get_suggested_order() to bpf_hook_get_thp_order(). 
  - Rename bpf_thp.c to huge_memory_bpf.c
  - Focuse the current patchset on THP order selection
  - Add the BPF hook into thp_vma_allowable_orders()
  - Make the hook VMA-based and remove the mm parameter
  - Modify the BPF program to return a single order
  - Stop passing vma_flags directly to BPF programs
  - Mark vma->vm_mm as trusted_or_null
  - Change the MAINTAINER file
From Andrii:
  - Mark mm->owner as rcu_or_null to avoid introducing new helpers
From Barry:
  - decouple swap from the normal page fault path
kernel test robot:
  - Fix a sparse warning
Shakeel helped clarify the implementation.

RFC v5-> v6: https://lwn.net/Articles/1035116/
- Code improvement around the RCU usage (Usama)
- Add selftests for khugepaged fork (Usama)
- Add performance data for page fault (Usama)
- Remove the RFC tag

RFC v4->v5: https://lwn.net/Articles/1034265/
- Add support for vma (David)
- Add mTHP support in khugepaged (Zi)
- Use bitmask of all allowed orders instead (Zi)
- Retrieve the page size and PMD order rather than hardcoding them (Zi)

RFC v3->v4: https://lwn.net/Articles/1031829/
- Use a new interface get_suggested_order() (David)
- Mark it as experimental (David, Lorenzo)
- Code improvement in THP (Usama)
- Code improvement in BPF struct ops (Amery)

RFC v2->v3: https://lwn.net/Articles/1024545/
- Finer-graind tuning based on madvise or always mode (David, Lorenzo)
- Use BPF to write more advanced policies logic (David, Lorenzo)

RFC v1->v2: https://lwn.net/Articles/1021783/
The main changes are as follows,
- Use struct_ops instead of fmod_ret (Alexei)
- Introduce a new THP mode (Johannes)
- Introduce new helpers for BPF hook (Zi)
- Refine the commit log

RFC v1: https://lwn.net/Articles/1019290/

Yafang Shao (10):
  mm: thp: remove disabled task from khugepaged_mm_slot
  mm: thp: add support for BPF based THP order selection
  mm: thp: decouple THP allocation between swap and page fault paths
  mm: thp: enable THP allocation exclusively through khugepaged
  bpf: mark mm->owner as __safe_rcu_or_null
  bpf: mark vma->vm_mm as __safe_trusted_or_null
  selftests/bpf: add a simple BPF based THP policy
  selftests/bpf: add test case to update THP policy
  selftests/bpf: add test cases for invalid thp_adjust usage
  Documentation: add BPF-based THP policy management

 Documentation/admin-guide/mm/transhuge.rst    |  46 +++
 MAINTAINERS                                   |   3 +
 include/linux/huge_mm.h                       |  29 +-
 include/linux/khugepaged.h                    |   1 +
 kernel/bpf/verifier.c                         |   8 +
 kernel/sys.c                                  |   6 +
 mm/Kconfig                                    |  12 +
 mm/Makefile                                   |   1 +
 mm/huge_memory.c                              |   3 +-
 mm/huge_memory_bpf.c                          | 243 +++++++++++++++
 mm/khugepaged.c                               |  19 +-
 mm/memory.c                                   |  15 +-
 tools/testing/selftests/bpf/config            |   3 +
 .../selftests/bpf/prog_tests/thp_adjust.c     | 284 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/lsm.c       |   8 +-
 .../selftests/bpf/progs/test_thp_adjust.c     | 114 +++++++
 .../bpf/progs/test_thp_adjust_sleepable.c     |  22 ++
 .../bpf/progs/test_thp_adjust_trusted_owner.c |  30 ++
 .../bpf/progs/test_thp_adjust_trusted_vma.c   |  27 ++
 19 files changed, 849 insertions(+), 25 deletions(-)
 create mode 100644 mm/huge_memory_bpf.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_sleepable.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_trusted_owner.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_trusted_vma.c

-- 
2.47.3

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v7 mm-new 01/10] mm: thp: remove disabled task from khugepaged_mm_slot
  2025-09-10  2:44 [PATCH v7 mm-new 0/9] mm, bpf: BPF based THP order selection Yafang Shao
@ 2025-09-10  2:44 ` Yafang Shao
  2025-09-10  5:11   ` Lance Yang
                     ` (3 more replies)
  2025-09-10  2:44 ` [PATCH v7 mm-new 02/10] mm: thp: add support for BPF based THP order selection Yafang Shao
                   ` (9 subsequent siblings)
  10 siblings, 4 replies; 61+ messages in thread
From: Yafang Shao @ 2025-09-10  2:44 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, 21cnbao, shakeel.butt
  Cc: bpf, linux-mm, linux-doc, Yafang Shao, Lance Yang

Since a task with MMF_DISABLE_THP_COMPLETELY cannot use THP, remove it from
the khugepaged_mm_slot to stop khugepaged from processing it.

After this change, the following semantic relationship always holds:

  MMF_VM_HUGEPAGE is set     == task is in khugepaged mm_slot
  MMF_VM_HUGEPAGE is not set == task is not in khugepaged mm_slot

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Lance Yang <ioworker0@gmail.com>
---
 include/linux/khugepaged.h |  1 +
 kernel/sys.c               |  6 ++++++
 mm/khugepaged.c            | 19 +++++++++----------
 3 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index eb1946a70cff..6cb9107f1006 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -19,6 +19,7 @@ extern void khugepaged_min_free_kbytes_update(void);
 extern bool current_is_khugepaged(void);
 extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 				   bool install_pmd);
+bool hugepage_pmd_enabled(void);
 
 static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
 {
diff --git a/kernel/sys.c b/kernel/sys.c
index a46d9b75880b..a1c1e8007f2d 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -8,6 +8,7 @@
 #include <linux/export.h>
 #include <linux/mm.h>
 #include <linux/mm_inline.h>
+#include <linux/khugepaged.h>
 #include <linux/utsname.h>
 #include <linux/mman.h>
 #include <linux/reboot.h>
@@ -2493,6 +2494,11 @@ static int prctl_set_thp_disable(bool thp_disable, unsigned long flags,
 		mm_flags_clear(MMF_DISABLE_THP_COMPLETELY, mm);
 		mm_flags_clear(MMF_DISABLE_THP_EXCEPT_ADVISED, mm);
 	}
+
+	if (!mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm) &&
+	    !mm_flags_test(MMF_VM_HUGEPAGE, mm) &&
+	    hugepage_pmd_enabled())
+		__khugepaged_enter(mm);
 	mmap_write_unlock(current->mm);
 	return 0;
 }
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 4ec324a4c1fe..88ac482fb3a0 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -413,7 +413,7 @@ static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm)
 		mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm);
 }
 
-static bool hugepage_pmd_enabled(void)
+bool hugepage_pmd_enabled(void)
 {
 	/*
 	 * We cover the anon, shmem and the file-backed case here; file-backed
@@ -445,6 +445,7 @@ void __khugepaged_enter(struct mm_struct *mm)
 
 	/* __khugepaged_exit() must not run from under us */
 	VM_BUG_ON_MM(hpage_collapse_test_exit(mm), mm);
+	WARN_ON_ONCE(mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm));
 	if (unlikely(mm_flags_test_and_set(MMF_VM_HUGEPAGE, mm)))
 		return;
 
@@ -472,7 +473,8 @@ void __khugepaged_enter(struct mm_struct *mm)
 void khugepaged_enter_vma(struct vm_area_struct *vma,
 			  vm_flags_t vm_flags)
 {
-	if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
+	if (!mm_flags_test(MMF_DISABLE_THP_COMPLETELY, vma->vm_mm) &&
+	    !mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
 	    hugepage_pmd_enabled()) {
 		if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
 			__khugepaged_enter(vma->vm_mm);
@@ -1451,16 +1453,13 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
 
 	lockdep_assert_held(&khugepaged_mm_lock);
 
-	if (hpage_collapse_test_exit(mm)) {
+	if (hpage_collapse_test_exit_or_disable(mm)) {
 		/* free mm_slot */
 		hash_del(&slot->hash);
 		list_del(&slot->mm_node);
 
-		/*
-		 * Not strictly needed because the mm exited already.
-		 *
-		 * mm_flags_clear(MMF_VM_HUGEPAGE, mm);
-		 */
+		/* If the mm is disabled, this flag must be cleared. */
+		mm_flags_clear(MMF_VM_HUGEPAGE, mm);
 
 		/* khugepaged_mm_lock actually not necessary for the below */
 		mm_slot_free(mm_slot_cache, mm_slot);
@@ -2507,9 +2506,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
 	VM_BUG_ON(khugepaged_scan.mm_slot != mm_slot);
 	/*
 	 * Release the current mm_slot if this mm is about to die, or
-	 * if we scanned all vmas of this mm.
+	 * if we scanned all vmas of this mm, or if this mm is disabled.
 	 */
-	if (hpage_collapse_test_exit(mm) || !vma) {
+	if (hpage_collapse_test_exit_or_disable(mm) || !vma) {
 		/*
 		 * Make sure that if mm_users is reaching zero while
 		 * khugepaged runs here, khugepaged_exit will find
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v7 mm-new 02/10] mm: thp: add support for BPF based THP order selection
  2025-09-10  2:44 [PATCH v7 mm-new 0/9] mm, bpf: BPF based THP order selection Yafang Shao
  2025-09-10  2:44 ` [PATCH v7 mm-new 01/10] mm: thp: remove disabled task from khugepaged_mm_slot Yafang Shao
@ 2025-09-10  2:44 ` Yafang Shao
  2025-09-10 12:42   ` Lance Yang
                     ` (3 more replies)
  2025-09-10  2:44 ` [PATCH v7 mm-new 03/10] mm: thp: decouple THP allocation between swap and page fault paths Yafang Shao
                   ` (8 subsequent siblings)
  10 siblings, 4 replies; 61+ messages in thread
From: Yafang Shao @ 2025-09-10  2:44 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, 21cnbao, shakeel.butt
  Cc: bpf, linux-mm, linux-doc, Yafang Shao

This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
THP tuning. It includes a hook bpf_hook_thp_get_order(), allowing BPF
programs to influence THP order selection based on factors such as:
- Workload identity
  For example, workloads running in specific containers or cgroups.
- Allocation context
  Whether the allocation occurs during a page fault, khugepaged, swap or
  other paths.
- VMA's memory advice settings
  MADV_HUGEPAGE or MADV_NOHUGEPAGE
- Memory pressure
  PSI system data or associated cgroup PSI metrics

The kernel API of this new BPF hook is as follows,

/**
 * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation
 * @vma: vm_area_struct associated with the THP allocation
 * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
 *            BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if
 *            neither is set.
 * @tva_type: TVA type for current @vma
 * @orders: Bitmask of requested THP orders for this allocation
 *          - PMD-mapped allocation if PMD_ORDER is set
 *          - mTHP allocation otherwise
 *
 * Return: The suggested THP order from the BPF program for allocation. It will
 *         not exceed the highest requested order in @orders. Return -1 to
 *         indicate that the original requested @orders should remain unchanged.
 */
typedef int thp_order_fn_t(struct vm_area_struct *vma,
			   enum bpf_thp_vma_type vma_type,
			   enum tva_type tva_type,
			   unsigned long orders);

Only a single BPF program can be attached at any given time, though it can
be dynamically updated to adjust the policy. The implementation supports
anonymous THP, shmem THP, and mTHP, with future extensions planned for
file-backed THP.

This functionality is only active when system-wide THP is configured to
madvise or always mode. It remains disabled in never mode. Additionally,
if THP is explicitly disabled for a specific task via prctl(), this BPF
functionality will also be unavailable for that task.

This feature requires CONFIG_BPF_GET_THP_ORDER (marked EXPERIMENTAL) to be
enabled. Note that this capability is currently unstable and may undergo
significant changes—including potential removal—in future kernel versions.

Suggested-by: David Hildenbrand <david@redhat.com>
Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 MAINTAINERS             |   1 +
 include/linux/huge_mm.h |  26 ++++-
 mm/Kconfig              |  12 ++
 mm/Makefile             |   1 +
 mm/huge_memory_bpf.c    | 243 ++++++++++++++++++++++++++++++++++++++++
 5 files changed, 280 insertions(+), 3 deletions(-)
 create mode 100644 mm/huge_memory_bpf.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 8fef05bc2224..d055a3c95300 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16252,6 +16252,7 @@ F:	include/linux/huge_mm.h
 F:	include/linux/khugepaged.h
 F:	include/trace/events/huge_memory.h
 F:	mm/huge_memory.c
+F:	mm/huge_memory_bpf.c
 F:	mm/khugepaged.c
 F:	mm/mm_slot.h
 F:	tools/testing/selftests/mm/khugepaged.c
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 23f124493c47..f72a5fd04e4f 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -56,6 +56,7 @@ enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
 	TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
 	TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
+	TRANSPARENT_HUGEPAGE_BPF_ATTACHED,      /* BPF prog is attached */
 };
 
 struct kobject;
@@ -270,6 +271,19 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
 					 enum tva_type type,
 					 unsigned long orders);
 
+#ifdef CONFIG_BPF_GET_THP_ORDER
+unsigned long
+bpf_hook_thp_get_orders(struct vm_area_struct *vma, vm_flags_t vma_flags,
+			enum tva_type type, unsigned long orders);
+#else
+static inline unsigned long
+bpf_hook_thp_get_orders(struct vm_area_struct *vma, vm_flags_t vma_flags,
+			enum tva_type tva_flags, unsigned long orders)
+{
+	return orders;
+}
+#endif
+
 /**
  * thp_vma_allowable_orders - determine hugepage orders that are allowed for vma
  * @vma:  the vm area to check
@@ -291,6 +305,12 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
 				       enum tva_type type,
 				       unsigned long orders)
 {
+	unsigned long bpf_orders;
+
+	bpf_orders = bpf_hook_thp_get_orders(vma, vm_flags, type, orders);
+	if (!bpf_orders)
+		return 0;
+
 	/*
 	 * Optimization to check if required orders are enabled early. Only
 	 * forced collapse ignores sysfs configs.
@@ -304,12 +324,12 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
 		    ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()))
 			mask |= READ_ONCE(huge_anon_orders_inherit);
 
-		orders &= mask;
-		if (!orders)
+		bpf_orders &= mask;
+		if (!bpf_orders)
 			return 0;
 	}
 
-	return __thp_vma_allowable_orders(vma, vm_flags, type, orders);
+	return __thp_vma_allowable_orders(vma, vm_flags, type, bpf_orders);
 }
 
 struct thpsize {
diff --git a/mm/Kconfig b/mm/Kconfig
index d1ed839ca710..4d89d2158f10 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -896,6 +896,18 @@ config NO_PAGE_MAPCOUNT
 
 	  EXPERIMENTAL because the impact of some changes is still unclear.
 
+config BPF_GET_THP_ORDER
+	bool "BPF-based THP order selection (EXPERIMENTAL)"
+	depends on TRANSPARENT_HUGEPAGE && BPF_SYSCALL
+
+	help
+	  Enable dynamic THP order selection using BPF programs. This
+	  experimental feature allows custom BPF logic to determine optimal
+	  transparent hugepage allocation sizes at runtime.
+
+	  WARNING: This feature is unstable and may change in future kernel
+	  versions.
+
 endif # TRANSPARENT_HUGEPAGE
 
 # simple helper to make the code a bit easier to read
diff --git a/mm/Makefile b/mm/Makefile
index 21abb3353550..f180332f2ad0 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -99,6 +99,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_NUMA) += memory-tiers.o
 obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
+obj-$(CONFIG_BPF_GET_THP_ORDER) += huge_memory_bpf.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
 obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
 obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
diff --git a/mm/huge_memory_bpf.c b/mm/huge_memory_bpf.c
new file mode 100644
index 000000000000..525ee22ab598
--- /dev/null
+++ b/mm/huge_memory_bpf.c
@@ -0,0 +1,243 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BPF-based THP policy management
+ *
+ * Author: Yafang Shao <laoar.shao@gmail.com>
+ */
+
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/huge_mm.h>
+#include <linux/khugepaged.h>
+
+enum bpf_thp_vma_type {
+	BPF_THP_VM_NONE = 0,
+	BPF_THP_VM_HUGEPAGE,	/* VM_HUGEPAGE */
+	BPF_THP_VM_NOHUGEPAGE,	/* VM_NOHUGEPAGE */
+};
+
+/**
+ * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation
+ * @vma: vm_area_struct associated with the THP allocation
+ * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
+ *            BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if
+ *            neither is set.
+ * @tva_type: TVA type for current @vma
+ * @orders: Bitmask of requested THP orders for this allocation
+ *          - PMD-mapped allocation if PMD_ORDER is set
+ *          - mTHP allocation otherwise
+ *
+ * Return: The suggested THP order from the BPF program for allocation. It will
+ *         not exceed the highest requested order in @orders. Return -1 to
+ *         indicate that the original requested @orders should remain unchanged.
+ */
+typedef int thp_order_fn_t(struct vm_area_struct *vma,
+			   enum bpf_thp_vma_type vma_type,
+			   enum tva_type tva_type,
+			   unsigned long orders);
+
+struct bpf_thp_ops {
+	thp_order_fn_t __rcu *thp_get_order;
+};
+
+static struct bpf_thp_ops bpf_thp;
+static DEFINE_SPINLOCK(thp_ops_lock);
+
+/*
+ * Returns the original @orders if no BPF program is attached or if the
+ * suggested order is invalid.
+ */
+unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
+				      vm_flags_t vma_flags,
+				      enum tva_type tva_type,
+				      unsigned long orders)
+{
+	thp_order_fn_t *bpf_hook_thp_get_order;
+	unsigned long thp_orders = orders;
+	enum bpf_thp_vma_type vma_type;
+	int thp_order;
+
+	/* No BPF program is attached */
+	if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
+		      &transparent_hugepage_flags))
+		return orders;
+
+	if (vma_flags & VM_HUGEPAGE)
+		vma_type = BPF_THP_VM_HUGEPAGE;
+	else if (vma_flags & VM_NOHUGEPAGE)
+		vma_type = BPF_THP_VM_NOHUGEPAGE;
+	else
+		vma_type = BPF_THP_VM_NONE;
+
+	rcu_read_lock();
+	bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
+	if (!bpf_hook_thp_get_order)
+		goto out;
+
+	thp_order = bpf_hook_thp_get_order(vma, vma_type, tva_type, orders);
+	if (thp_order < 0)
+		goto out;
+	/*
+	 * The maximum requested order is determined by the callsite. E.g.:
+	 * - PMD-mapped THP uses PMD_ORDER
+	 * - mTHP uses (PMD_ORDER - 1)
+	 *
+	 * We must respect this upper bound to avoid undefined behavior. So the
+	 * highest suggested order can't exceed the highest requested order.
+	 */
+	if (thp_order <= highest_order(orders))
+		thp_orders = BIT(thp_order);
+
+out:
+	rcu_read_unlock();
+	return thp_orders;
+}
+
+static bool bpf_thp_ops_is_valid_access(int off, int size,
+					enum bpf_access_type type,
+					const struct bpf_prog *prog,
+					struct bpf_insn_access_aux *info)
+{
+	return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
+}
+
+static const struct bpf_func_proto *
+bpf_thp_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	return bpf_base_func_proto(func_id, prog);
+}
+
+static const struct bpf_verifier_ops thp_bpf_verifier_ops = {
+	.get_func_proto = bpf_thp_get_func_proto,
+	.is_valid_access = bpf_thp_ops_is_valid_access,
+};
+
+static int bpf_thp_init(struct btf *btf)
+{
+	return 0;
+}
+
+static int bpf_thp_check_member(const struct btf_type *t,
+				const struct btf_member *member,
+				const struct bpf_prog *prog)
+{
+	/* The call site operates under RCU protection. */
+	if (prog->sleepable)
+		return -EINVAL;
+	return 0;
+}
+
+static int bpf_thp_init_member(const struct btf_type *t,
+			       const struct btf_member *member,
+			       void *kdata, const void *udata)
+{
+	return 0;
+}
+
+static int bpf_thp_reg(void *kdata, struct bpf_link *link)
+{
+	struct bpf_thp_ops *ops = kdata;
+
+	spin_lock(&thp_ops_lock);
+	if (test_and_set_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
+			     &transparent_hugepage_flags)) {
+		spin_unlock(&thp_ops_lock);
+		return -EBUSY;
+	}
+	WARN_ON_ONCE(rcu_access_pointer(bpf_thp.thp_get_order));
+	rcu_assign_pointer(bpf_thp.thp_get_order, ops->thp_get_order);
+	spin_unlock(&thp_ops_lock);
+	return 0;
+}
+
+static void bpf_thp_unreg(void *kdata, struct bpf_link *link)
+{
+	thp_order_fn_t *old_fn;
+
+	spin_lock(&thp_ops_lock);
+	clear_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, &transparent_hugepage_flags);
+	old_fn = rcu_replace_pointer(bpf_thp.thp_get_order, NULL,
+				     lockdep_is_held(&thp_ops_lock));
+	WARN_ON_ONCE(!old_fn);
+	spin_unlock(&thp_ops_lock);
+
+	synchronize_rcu();
+}
+
+static int bpf_thp_update(void *kdata, void *old_kdata, struct bpf_link *link)
+{
+	thp_order_fn_t *old_fn, *new_fn;
+	struct bpf_thp_ops *old = old_kdata;
+	struct bpf_thp_ops *ops = kdata;
+	int ret = 0;
+
+	if (!ops || !old)
+		return -EINVAL;
+
+	spin_lock(&thp_ops_lock);
+	/* The prog has aleady been removed. */
+	if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
+		      &transparent_hugepage_flags)) {
+		ret = -ENOENT;
+		goto out;
+	}
+
+	new_fn = rcu_dereference(ops->thp_get_order);
+	old_fn = rcu_replace_pointer(bpf_thp.thp_get_order, new_fn,
+				     lockdep_is_held(&thp_ops_lock));
+	WARN_ON_ONCE(!old_fn || !new_fn);
+
+out:
+	spin_unlock(&thp_ops_lock);
+	if (!ret)
+		synchronize_rcu();
+	return ret;
+}
+
+static int bpf_thp_validate(void *kdata)
+{
+	struct bpf_thp_ops *ops = kdata;
+
+	if (!ops->thp_get_order) {
+		pr_err("bpf_thp: required ops isn't implemented\n");
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static int bpf_thp_get_order(struct vm_area_struct *vma,
+			     enum bpf_thp_vma_type vma_type,
+			     enum tva_type tva_type,
+			     unsigned long orders)
+{
+	return -1;
+}
+
+static struct bpf_thp_ops __bpf_thp_ops = {
+	.thp_get_order = (thp_order_fn_t __rcu *)bpf_thp_get_order,
+};
+
+static struct bpf_struct_ops bpf_bpf_thp_ops = {
+	.verifier_ops = &thp_bpf_verifier_ops,
+	.init = bpf_thp_init,
+	.check_member = bpf_thp_check_member,
+	.init_member = bpf_thp_init_member,
+	.reg = bpf_thp_reg,
+	.unreg = bpf_thp_unreg,
+	.update = bpf_thp_update,
+	.validate = bpf_thp_validate,
+	.cfi_stubs = &__bpf_thp_ops,
+	.owner = THIS_MODULE,
+	.name = "bpf_thp_ops",
+};
+
+static int __init bpf_thp_ops_init(void)
+{
+	int err;
+
+	err = register_bpf_struct_ops(&bpf_bpf_thp_ops, bpf_thp_ops);
+	if (err)
+		pr_err("bpf_thp: Failed to register struct_ops (%d)\n", err);
+	return err;
+}
+late_initcall(bpf_thp_ops_init);
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v7 mm-new 03/10] mm: thp: decouple THP allocation between swap and page fault paths
  2025-09-10  2:44 [PATCH v7 mm-new 0/9] mm, bpf: BPF based THP order selection Yafang Shao
  2025-09-10  2:44 ` [PATCH v7 mm-new 01/10] mm: thp: remove disabled task from khugepaged_mm_slot Yafang Shao
  2025-09-10  2:44 ` [PATCH v7 mm-new 02/10] mm: thp: add support for BPF based THP order selection Yafang Shao
@ 2025-09-10  2:44 ` Yafang Shao
  2025-09-11 14:55   ` Lorenzo Stoakes
  2025-09-10  2:44 ` [PATCH v7 mm-new 04/10] mm: thp: enable THP allocation exclusively through khugepaged Yafang Shao
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 61+ messages in thread
From: Yafang Shao @ 2025-09-10  2:44 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, 21cnbao, shakeel.butt
  Cc: bpf, linux-mm, linux-doc, Yafang Shao

The new BPF capability enables finer-grained THP policy decisions by
introducing separate handling for swap faults versus normal page faults.

As highlighted by Barry:

  We’ve observed that swapping in large folios can lead to more
  swap thrashing for some workloads- e.g. kernel build. Consequently,
  some workloads might prefer swapping in smaller folios than those
  allocated by alloc_anon_folio().

While prtcl() could potentially be extended to leverage this new policy,
doing so would require modifications to the uAPI.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Barry Song <21cnbao@gmail.com>
---
 include/linux/huge_mm.h | 3 ++-
 mm/huge_memory.c        | 2 +-
 mm/memory.c             | 2 +-
 3 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index f72a5fd04e4f..b9742453806f 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -97,9 +97,10 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
 
 enum tva_type {
 	TVA_SMAPS,		/* Exposing "THPeligible:" in smaps. */
-	TVA_PAGEFAULT,		/* Serving a page fault. */
+	TVA_PAGEFAULT,		/* Serving a non-swap page fault. */
 	TVA_KHUGEPAGED,		/* Khugepaged collapse. */
 	TVA_FORCED_COLLAPSE,	/* Forced collapse (e.g. MADV_COLLAPSE). */
+	TVA_SWAP,		/* Serving a swap */
 };
 
 #define thp_vma_allowable_order(vma, vm_flags, type, order) \
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 26cedfcd7418..523153d21a41 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -103,7 +103,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
 					 unsigned long orders)
 {
 	const bool smaps = type == TVA_SMAPS;
-	const bool in_pf = type == TVA_PAGEFAULT;
+	const bool in_pf = (type == TVA_PAGEFAULT || type == TVA_SWAP);
 	const bool forced_collapse = type == TVA_FORCED_COLLAPSE;
 	unsigned long supported_orders;
 
diff --git a/mm/memory.c b/mm/memory.c
index d9de6c056179..d8819cac7930 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4515,7 +4515,7 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
 	 * and suitable for swapping THP.
 	 */
-	orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT,
+	orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_SWAP,
 					  BIT(PMD_ORDER) - 1);
 	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
 	orders = thp_swap_suitable_orders(swp_offset(entry),
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v7 mm-new 04/10] mm: thp: enable THP allocation exclusively through khugepaged
  2025-09-10  2:44 [PATCH v7 mm-new 0/9] mm, bpf: BPF based THP order selection Yafang Shao
                   ` (2 preceding siblings ...)
  2025-09-10  2:44 ` [PATCH v7 mm-new 03/10] mm: thp: decouple THP allocation between swap and page fault paths Yafang Shao
@ 2025-09-10  2:44 ` Yafang Shao
  2025-09-11 15:53   ` Lance Yang
  2025-09-11 15:58   ` Lorenzo Stoakes
  2025-09-10  2:44 ` [PATCH v7 mm-new 05/10] bpf: mark mm->owner as __safe_rcu_or_null Yafang Shao
                   ` (6 subsequent siblings)
  10 siblings, 2 replies; 61+ messages in thread
From: Yafang Shao @ 2025-09-10  2:44 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, 21cnbao, shakeel.butt
  Cc: bpf, linux-mm, linux-doc, Yafang Shao

Currently, THP allocation cannot be restricted to khugepaged alone while
being disabled in the page fault path. This limitation exists because
disabling THP allocation during page faults also prevents the execution of
khugepaged_enter_vma() in that path.

With the introduction of BPF, we can now implement THP policies based on
different TVA types. This patch adjusts the logic to support this new
capability.

While we could also extend prtcl() to utilize this new policy, such a
change would require a uAPI modification.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 mm/huge_memory.c |  1 -
 mm/memory.c      | 13 ++++++++-----
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 523153d21a41..1e9e7b32e2cf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1346,7 +1346,6 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 	ret = vmf_anon_prepare(vmf);
 	if (ret)
 		return ret;
-	khugepaged_enter_vma(vma, vma->vm_flags);
 
 	if (!(vmf->flags & FAULT_FLAG_WRITE) &&
 			!mm_forbids_zeropage(vma->vm_mm) &&
diff --git a/mm/memory.c b/mm/memory.c
index d8819cac7930..d0609dc1e371 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6289,11 +6289,14 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 	if (pud_trans_unstable(vmf.pud))
 		goto retry_pud;
 
-	if (pmd_none(*vmf.pmd) &&
-	    thp_vma_allowable_order(vma, vm_flags, TVA_PAGEFAULT, PMD_ORDER)) {
-		ret = create_huge_pmd(&vmf);
-		if (!(ret & VM_FAULT_FALLBACK))
-			return ret;
+	if (pmd_none(*vmf.pmd)) {
+		if (vma_is_anonymous(vma))
+			khugepaged_enter_vma(vma, vm_flags);
+		if (thp_vma_allowable_order(vma, vm_flags, TVA_PAGEFAULT, PMD_ORDER)) {
+			ret = create_huge_pmd(&vmf);
+			if (!(ret & VM_FAULT_FALLBACK))
+				return ret;
+		}
 	} else {
 		vmf.orig_pmd = pmdp_get_lockless(vmf.pmd);
 
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v7 mm-new 05/10] bpf: mark mm->owner as __safe_rcu_or_null
  2025-09-10  2:44 [PATCH v7 mm-new 0/9] mm, bpf: BPF based THP order selection Yafang Shao
                   ` (3 preceding siblings ...)
  2025-09-10  2:44 ` [PATCH v7 mm-new 04/10] mm: thp: enable THP allocation exclusively through khugepaged Yafang Shao
@ 2025-09-10  2:44 ` Yafang Shao
  2025-09-11 16:04   ` Lorenzo Stoakes
  2025-09-10  2:44 ` [PATCH v7 mm-new 06/10] bpf: mark vma->vm_mm as __safe_trusted_or_null Yafang Shao
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 61+ messages in thread
From: Yafang Shao @ 2025-09-10  2:44 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, 21cnbao, shakeel.butt
  Cc: bpf, linux-mm, linux-doc, Yafang Shao

When CONFIG_MEMCG is enabled, we can access mm->owner under RCU. The
owner can be NULL. With this change, BPF helpers can safely access
mm->owner to retrieve the associated task from the mm. We can then make
policy decision based on the task attribute.

The typical use case is as follows,

  bpf_rcu_read_lock(); // rcu lock must be held for rcu trusted field
  @owner = @mm->owner; // mm_struct::owner is rcu trusted or null
  if (!@owner)
      goto out;

  /* Do something based on the task attribute */

out:
  bpf_rcu_read_unlock();

Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 kernel/bpf/verifier.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index c4f69a9e9af6..d400e18ee31e 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -7123,6 +7123,9 @@ BTF_TYPE_SAFE_RCU(struct cgroup_subsys_state) {
 /* RCU trusted: these fields are trusted in RCU CS and can be NULL */
 BTF_TYPE_SAFE_RCU_OR_NULL(struct mm_struct) {
 	struct file __rcu *exe_file;
+#ifdef CONFIG_MEMCG
+	struct task_struct __rcu *owner;
+#endif
 };
 
 /* skb->sk, req->sk are not RCU protected, but we mark them as such
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v7 mm-new 06/10] bpf: mark vma->vm_mm as __safe_trusted_or_null
  2025-09-10  2:44 [PATCH v7 mm-new 0/9] mm, bpf: BPF based THP order selection Yafang Shao
                   ` (4 preceding siblings ...)
  2025-09-10  2:44 ` [PATCH v7 mm-new 05/10] bpf: mark mm->owner as __safe_rcu_or_null Yafang Shao
@ 2025-09-10  2:44 ` Yafang Shao
  2025-09-11 17:08   ` Lorenzo Stoakes
  2025-09-11 17:30   ` Liam R. Howlett
  2025-09-10  2:44 ` [PATCH v7 mm-new 07/10] selftests/bpf: add a simple BPF based THP policy Yafang Shao
                   ` (4 subsequent siblings)
  10 siblings, 2 replies; 61+ messages in thread
From: Yafang Shao @ 2025-09-10  2:44 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, 21cnbao, shakeel.butt
  Cc: bpf, linux-mm, linux-doc, Yafang Shao

The vma->vm_mm might be NULL and it can be accessed outside of RCU. Thus,
we can mark it as trusted_or_null. With this change, BPF helpers can safely
access vma->vm_mm to retrieve the associated mm_struct from the VMA.
Then we can make policy decision from the VMA.

The lsm selftest must be modified because it directly accesses vma->vm_mm
without a NULL pointer check; otherwise it will break due to this
change.

For the VMA based THP policy, the use case is as follows,

  @mm = @vma->vm_mm; // vm_area_struct::vm_mm is trusted or null
  if (!@mm)
      return;
  bpf_rcu_read_lock(); // rcu lock must be held to dereference the owner
  @owner = @mm->owner; // mm_struct::owner is rcu trusted or null
  if (!@owner)
    goto out;
  @cgroup1 = bpf_task_get_cgroup1(@owner, MEMCG_HIERARCHY_ID);

  /* make the decision based on the @cgroup1 attribute */

  bpf_cgroup_release(@cgroup1); // release the associated cgroup
out:
  bpf_rcu_read_unlock();

PSI memory information can be obtained from the associated cgroup to inform
policy decisions. Since upstream PSI support is currently limited to cgroup
v2, the following example demonstrates cgroup v2 implementation:

  @owner = @mm->owner;
  if (@owner) {
      // @ancestor_cgid is user-configured
      @ancestor = bpf_cgroup_from_id(@ancestor_cgid);
      if (bpf_task_under_cgroup(@owner, @ancestor)) {
          @psi_group = @ancestor->psi;

        /* Extract PSI metrics from @psi_group and
         * implement policy logic based on the values
         */

      }
  }

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 kernel/bpf/verifier.c                   | 5 +++++
 tools/testing/selftests/bpf/progs/lsm.c | 8 +++++---
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index d400e18ee31e..b708b98f796c 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -7165,6 +7165,10 @@ BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct socket) {
 	struct sock *sk;
 };
 
+BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct vm_area_struct) {
+	struct mm_struct *vm_mm;
+};
+
 static bool type_is_rcu(struct bpf_verifier_env *env,
 			struct bpf_reg_state *reg,
 			const char *field_name, u32 btf_id)
@@ -7206,6 +7210,7 @@ static bool type_is_trusted_or_null(struct bpf_verifier_env *env,
 {
 	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct socket));
 	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct dentry));
+	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct vm_area_struct));
 
 	return btf_nested_type_is_trusted(&env->log, reg, field_name, btf_id,
 					  "__safe_trusted_or_null");
diff --git a/tools/testing/selftests/bpf/progs/lsm.c b/tools/testing/selftests/bpf/progs/lsm.c
index 0c13b7409947..7de173daf27b 100644
--- a/tools/testing/selftests/bpf/progs/lsm.c
+++ b/tools/testing/selftests/bpf/progs/lsm.c
@@ -89,14 +89,16 @@ SEC("lsm/file_mprotect")
 int BPF_PROG(test_int_hook, struct vm_area_struct *vma,
 	     unsigned long reqprot, unsigned long prot, int ret)
 {
-	if (ret != 0)
+	struct mm_struct *mm = vma->vm_mm;
+
+	if (ret != 0 || !mm)
 		return ret;
 
 	__s32 pid = bpf_get_current_pid_tgid() >> 32;
 	int is_stack = 0;
 
-	is_stack = (vma->vm_start <= vma->vm_mm->start_stack &&
-		    vma->vm_end >= vma->vm_mm->start_stack);
+	is_stack = (vma->vm_start <= mm->start_stack &&
+		    vma->vm_end >= mm->start_stack);
 
 	if (is_stack && monitored_pid == pid) {
 		mprotect_count++;
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v7 mm-new 07/10] selftests/bpf: add a simple BPF based THP policy
  2025-09-10  2:44 [PATCH v7 mm-new 0/9] mm, bpf: BPF based THP order selection Yafang Shao
                   ` (5 preceding siblings ...)
  2025-09-10  2:44 ` [PATCH v7 mm-new 06/10] bpf: mark vma->vm_mm as __safe_trusted_or_null Yafang Shao
@ 2025-09-10  2:44 ` Yafang Shao
  2025-09-10 20:44   ` Alexei Starovoitov
  2025-09-10  2:44 ` [PATCH v7 mm-new 08/10] selftests/bpf: add test case to update " Yafang Shao
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 61+ messages in thread
From: Yafang Shao @ 2025-09-10  2:44 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, 21cnbao, shakeel.butt
  Cc: bpf, linux-mm, linux-doc, Yafang Shao

This selftest verifies that PMD-mapped THP allocation is restricted in
page faults for tasks within a specific cgroup, while still permitting
THP allocation via khugepaged.

Since THP allocation depends on various factors (e.g., system memory
pressure), using the actual allocated THP size for validation is
unreliable. Instead, we check the return value of get_suggested_order(),
which indicates whether the system intends to allocate a THP, regardless of
whether the allocation ultimately succeeds.

This test case defines a simple THP policy. The policy permits
PMD-mapped THP allocation through khugepaged for tasks in a designated
cgroup, but prohibits it for all other tasks and contexts, including the
page fault handler.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 MAINTAINERS                                   |   2 +
 tools/testing/selftests/bpf/config            |   3 +
 .../selftests/bpf/prog_tests/thp_adjust.c     | 254 ++++++++++++++++++
 .../selftests/bpf/progs/test_thp_adjust.c     | 100 +++++++
 4 files changed, 359 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c

diff --git a/MAINTAINERS b/MAINTAINERS
index d055a3c95300..6aa5543963d1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16255,6 +16255,8 @@ F:	mm/huge_memory.c
 F:	mm/huge_memory_bpf.c
 F:	mm/khugepaged.c
 F:	mm/mm_slot.h
+F:	tools/testing/selftests/bpf/prog_tests/thp_adjust.c
+F:	tools/testing/selftests/bpf/progs/test_thp_adjust*
 F:	tools/testing/selftests/mm/khugepaged.c
 F:	tools/testing/selftests/mm/split_huge_page_test.c
 F:	tools/testing/selftests/mm/transhuge-stress.c
diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
index 8916ab814a3e..b2c73cfae14e 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -26,6 +26,7 @@ CONFIG_DMABUF_HEAPS=y
 CONFIG_DMABUF_HEAPS_SYSTEM=y
 CONFIG_DUMMY=y
 CONFIG_DYNAMIC_FTRACE=y
+CONFIG_BPF_GET_THP_ORDER=y
 CONFIG_FPROBE=y
 CONFIG_FTRACE_SYSCALLS=y
 CONFIG_FUNCTION_ERROR_INJECTION=y
@@ -51,6 +52,7 @@ CONFIG_IPV6_TUNNEL=y
 CONFIG_KEYS=y
 CONFIG_LIRC=y
 CONFIG_LWTUNNEL=y
+CONFIG_MEMCG=y
 CONFIG_MODULE_SIG=y
 CONFIG_MODULE_SRCVERSION_ALL=y
 CONFIG_MODULE_UNLOAD=y
@@ -114,6 +116,7 @@ CONFIG_SECURITY=y
 CONFIG_SECURITYFS=y
 CONFIG_SYN_COOKIES=y
 CONFIG_TEST_BPF=m
+CONFIG_TRANSPARENT_HUGEPAGE=y
 CONFIG_UDMABUF=y
 CONFIG_USERFAULTFD=y
 CONFIG_VSOCKETS=y
diff --git a/tools/testing/selftests/bpf/prog_tests/thp_adjust.c b/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
new file mode 100644
index 000000000000..a4a34ee28301
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
@@ -0,0 +1,254 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <math.h>
+#include <sys/mman.h>
+#include <test_progs.h>
+#include "cgroup_helpers.h"
+#include "test_thp_adjust.skel.h"
+
+#define LEN (16 * 1024 * 1024) /* 16MB */
+#define THP_ENABLED_FILE "/sys/kernel/mm/transparent_hugepage/enabled"
+#define PMD_SIZE_FILE "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size"
+
+static struct test_thp_adjust *skel;
+static char *thp_addr, old_mode[32];
+static long pagesize;
+
+static int thp_mode_save(void)
+{
+	const char *start, *end;
+	char buf[128];
+	int fd, err;
+	size_t len;
+
+	fd = open(THP_ENABLED_FILE, O_RDONLY);
+	if (fd == -1)
+		return -1;
+
+	err = read(fd, buf, sizeof(buf) - 1);
+	if (err == -1)
+		goto close;
+
+	start = strchr(buf, '[');
+	end = start ? strchr(start, ']') : NULL;
+	if (!start || !end || end <= start) {
+		err = -1;
+		goto close;
+	}
+
+	len = end - start - 1;
+	if (len >= sizeof(old_mode))
+		len = sizeof(old_mode) - 1;
+	strncpy(old_mode, start + 1, len);
+	old_mode[len] = '\0';
+
+close:
+	close(fd);
+	return err;
+}
+
+static int thp_mode_set(const char *desired_mode)
+{
+	int fd, err;
+
+	fd = open(THP_ENABLED_FILE, O_RDWR);
+	if (fd == -1)
+		return -1;
+
+	err = write(fd, desired_mode, strlen(desired_mode));
+	close(fd);
+	return err;
+}
+
+static int thp_mode_reset(void)
+{
+	int fd, err;
+
+	fd = open(THP_ENABLED_FILE, O_WRONLY);
+	if (fd == -1)
+		return -1;
+
+	err = write(fd, old_mode, strlen(old_mode));
+	close(fd);
+	return err;
+}
+
+static int thp_alloc(void)
+{
+	int err, i;
+
+	thp_addr = mmap(NULL, LEN, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
+	if (thp_addr == MAP_FAILED)
+		return -1;
+
+	err = madvise(thp_addr, LEN, MADV_HUGEPAGE);
+	if (err == -1)
+		goto unmap;
+
+	/* Accessing a single byte within a page is sufficient to trigger a page fault. */
+	for (i = 0; i < LEN; i += pagesize)
+		thp_addr[i] = 1;
+	return 0;
+
+unmap:
+	munmap(thp_addr, LEN);
+	return -1;
+}
+
+static void thp_free(void)
+{
+	if (!thp_addr)
+		return;
+	munmap(thp_addr, LEN);
+}
+
+static int get_pmd_order(void)
+{
+	ssize_t bytes_read, size;
+	int fd, order, ret = -1;
+	char buf[64], *endptr;
+
+	fd = open(PMD_SIZE_FILE, O_RDONLY);
+	if (fd < 0)
+		return -1;
+
+	bytes_read = read(fd, buf, sizeof(buf) - 1);
+	if (bytes_read <= 0)
+		goto close_fd;
+
+	/* Remove potential newline character */
+	if (buf[bytes_read - 1] == '\n')
+		buf[bytes_read - 1] = '\0';
+
+	size = strtoul(buf, &endptr, 10);
+	if (endptr == buf || *endptr != '\0')
+		goto close_fd;
+	if (size % pagesize != 0)
+		goto close_fd;
+	ret = size / pagesize;
+	if ((ret & (ret - 1)) == 0) {
+		order = 0;
+		while (ret > 1) {
+			ret >>= 1;
+			order++;
+		}
+		ret = order;
+	}
+
+close_fd:
+	close(fd);
+	return ret;
+}
+
+static void subtest_thp_policy(void)
+{
+	struct bpf_link *fentry_link, *ops_link;
+
+	/* After attaching struct_ops, THP will be allocated only in khugepaged . */
+	ops_link = bpf_map__attach_struct_ops(skel->maps.khugepaged_ops);
+	if (!ASSERT_OK_PTR(ops_link, "attach struct_ops"))
+		return;
+
+	/* Create a new BPF program to detect the result. */
+	fentry_link = bpf_program__attach_trace(skel->progs.thp_run);
+	if (!ASSERT_OK_PTR(fentry_link, "attach fentry"))
+		goto detach_ops;
+	if (!ASSERT_NEQ(thp_alloc(), -1, "THP alloc"))
+		goto detach;
+
+	if (!ASSERT_EQ(skel->bss->pf_alloc, 0, "alloc_in_pf"))
+		goto thp_free;
+	if (!ASSERT_GT(skel->bss->pf_disallow, 0, "disallow_in_pf"))
+		goto thp_free;
+
+	ASSERT_EQ(skel->bss->khugepaged_disallow, 0, "disallow_in_khugepaged");
+thp_free:
+	thp_free();
+detach:
+	bpf_link__destroy(fentry_link);
+detach_ops:
+	bpf_link__destroy(ops_link);
+}
+
+static int thp_adjust_setup(void)
+{
+	int err, cgrp_fd, cgrp_id, pmd_order;
+
+	pagesize = sysconf(_SC_PAGESIZE);
+	pmd_order = get_pmd_order();
+	if (!ASSERT_NEQ(pmd_order, -1, "get_pmd_order"))
+		return -1;
+
+	err = setup_cgroup_environment();
+	if (!ASSERT_OK(err, "cgrp_env_setup"))
+		return -1;
+
+	cgrp_fd = create_and_get_cgroup("thp_adjust");
+	if (!ASSERT_GE(cgrp_fd, 0, "create_and_get_cgroup"))
+		goto cleanup;
+	close(cgrp_fd);
+
+	err = join_cgroup("thp_adjust");
+	if (!ASSERT_OK(err, "join_cgroup"))
+		goto remove_cgrp;
+
+	err = -1;
+	cgrp_id = get_cgroup_id("thp_adjust");
+	if (!ASSERT_GE(cgrp_id, 0, "create_and_get_cgroup"))
+		goto join_root;
+
+	if (!ASSERT_NEQ(thp_mode_save(), -1, "THP mode save"))
+		goto join_root;
+	if (!ASSERT_GE(thp_mode_set("madvise"), 0, "THP mode set"))
+		goto join_root;
+
+	skel = test_thp_adjust__open();
+	if (!ASSERT_OK_PTR(skel, "open"))
+		goto thp_reset;
+
+	skel->bss->cgrp_id = cgrp_id;
+	skel->bss->pmd_order = pmd_order;
+
+	err = test_thp_adjust__load(skel);
+	if (!ASSERT_OK(err, "load"))
+		goto destroy;
+	return 0;
+
+destroy:
+	test_thp_adjust__destroy(skel);
+thp_reset:
+	ASSERT_GE(thp_mode_reset(), 0, "THP mode reset");
+join_root:
+	/* We must join the root cgroup before removing the created cgroup. */
+	err = join_root_cgroup();
+	ASSERT_OK(err, "join_cgroup to root");
+remove_cgrp:
+	remove_cgroup("thp_adjust");
+cleanup:
+	cleanup_cgroup_environment();
+	return err;
+}
+
+static void thp_adjust_destroy(void)
+{
+	int err;
+
+	test_thp_adjust__destroy(skel);
+	ASSERT_GE(thp_mode_reset(), 0, "THP mode reset");
+	err = join_root_cgroup();
+	ASSERT_OK(err, "join_cgroup to root");
+	if (!err)
+		remove_cgroup("thp_adjust");
+	cleanup_cgroup_environment();
+}
+
+void test_thp_adjust(void)
+{
+	if (thp_adjust_setup() == -1)
+		return;
+
+	if (test__start_subtest("alloc_in_khugepaged"))
+		subtest_thp_policy();
+
+	thp_adjust_destroy();
+}
diff --git a/tools/testing/selftests/bpf/progs/test_thp_adjust.c b/tools/testing/selftests/bpf/progs/test_thp_adjust.c
new file mode 100644
index 000000000000..93c7927e827a
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_thp_adjust.c
@@ -0,0 +1,100 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+struct cgroup *bpf_cgroup_from_id(u64 cgid) __ksym;
+long bpf_task_under_cgroup(struct task_struct *task, struct cgroup *ancestor) __ksym;
+void bpf_cgroup_release(struct cgroup *p) __ksym;
+struct task_struct *bpf_task_acquire(struct task_struct *p) __ksym;
+void bpf_task_release(struct task_struct *p) __ksym;
+
+int pf_alloc, pf_disallow, khugepaged_disallow;
+struct mm_struct *target_mm;
+int pmd_order, cgrp_id;
+
+/* Detecting whether a task can successfully allocate THP is unreliable because
+ * it may be influenced by system memory pressure. Instead of making the result
+ * dependent on unpredictable factors, we should simply check
+ * bpf_hook_thp_get_orders()'s return value, which is deterministic.
+ */
+SEC("fexit/bpf_hook_thp_get_orders")
+int BPF_PROG(thp_run, struct vm_area_struct *vma, u64 vma_flags, enum tva_type tva_type,
+	     unsigned long orders, int retval)
+{
+	struct mm_struct *mm = vma->vm_mm;
+
+	if (mm != target_mm)
+		return 0;
+
+	if (orders != (1 << pmd_order))
+		return 0;
+
+	if (tva_type == TVA_PAGEFAULT) {
+		if (retval == (1 << pmd_order))
+			pf_alloc++;
+		else if (retval == (1 << 0))
+			pf_disallow++;
+	} else if (tva_type == TVA_KHUGEPAGED) {
+		/* khugepaged is not triggered immediately, so its allocation
+		 * counts are unreliable.
+		 */
+		if (retval == (1 << 0))
+			khugepaged_disallow++;
+	}
+	return 0;
+}
+
+SEC("struct_ops/thp_get_order")
+int BPF_PROG(alloc_in_khugepaged, struct vm_area_struct *vma, enum bpf_thp_vma_type vma_type,
+	     enum tva_type tva_type, unsigned long orders)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct task_struct *p, *acquired;
+	int suggested_order = 0;
+	struct cgroup *cgrp;
+
+	if (orders != (1 << pmd_order))
+		return 0;
+
+	if (!mm)
+		return 0;
+
+	/* This BPF hook is already under RCU */
+	p = mm->owner;
+	if (!p)
+		return 0;
+
+	acquired = bpf_task_acquire(p);
+	if (!acquired)
+		return 0;
+
+	cgrp = bpf_cgroup_from_id(cgrp_id);
+	if (!cgrp) {
+		bpf_task_release(acquired);
+		return 0;
+	}
+
+	if (bpf_task_under_cgroup(acquired, cgrp)) {
+		if (!target_mm)
+			target_mm = mm;
+
+		/* BPF THP allocation policy:
+		 * - Allow PMD allocation in khugepagd only
+		 * - "THPeligible" in /proc/<pid>/smaps is also set
+		 */
+		if (tva_type == TVA_KHUGEPAGED || tva_type == TVA_SMAPS)
+			suggested_order = pmd_order;
+	}
+	bpf_cgroup_release(cgrp);
+	bpf_task_release(acquired);
+	return suggested_order;
+}
+
+SEC(".struct_ops.link")
+struct bpf_thp_ops khugepaged_ops = {
+	.thp_get_order = (void *)alloc_in_khugepaged,
+};
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v7 mm-new 08/10] selftests/bpf: add test case to update THP policy
  2025-09-10  2:44 [PATCH v7 mm-new 0/9] mm, bpf: BPF based THP order selection Yafang Shao
                   ` (6 preceding siblings ...)
  2025-09-10  2:44 ` [PATCH v7 mm-new 07/10] selftests/bpf: add a simple BPF based THP policy Yafang Shao
@ 2025-09-10  2:44 ` Yafang Shao
  2025-09-10  2:44 ` [PATCH v7 mm-new 09/10] selftests/bpf: add test cases for invalid thp_adjust usage Yafang Shao
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2025-09-10  2:44 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, 21cnbao, shakeel.butt
  Cc: bpf, linux-mm, linux-doc, Yafang Shao

EBUSY is returned when attempting to install a new BPF program while one is
already running, though updates to existing programs are permitted.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 .../selftests/bpf/prog_tests/thp_adjust.c     | 23 +++++++++++++++++++
 .../selftests/bpf/progs/test_thp_adjust.c     | 14 +++++++++++
 2 files changed, 37 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/thp_adjust.c b/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
index a4a34ee28301..30172f2ee5d5 100644
--- a/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
+++ b/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
@@ -170,6 +170,27 @@ static void subtest_thp_policy(void)
 	bpf_link__destroy(ops_link);
 }
 
+static void subtest_thp_policy_update(void)
+{
+	struct bpf_link *old_link, *new_link;
+	int err;
+
+	old_link = bpf_map__attach_struct_ops(skel->maps.swap_ops);
+	if (!ASSERT_OK_PTR(old_link, "attach_old_link"))
+		return;
+
+	new_link = bpf_map__attach_struct_ops(skel->maps.khugepaged_ops);
+	if (!ASSERT_NULL(new_link, "attach_new_link"))
+		goto destory_old;
+	ASSERT_EQ(errno, EBUSY, "attach_new_link");
+
+	err = bpf_link__update_map(old_link, skel->maps.khugepaged_ops);
+	ASSERT_EQ(err, 0, "update_old_link");
+
+destory_old:
+	bpf_link__destroy(old_link);
+}
+
 static int thp_adjust_setup(void)
 {
 	int err, cgrp_fd, cgrp_id, pmd_order;
@@ -249,6 +270,8 @@ void test_thp_adjust(void)
 
 	if (test__start_subtest("alloc_in_khugepaged"))
 		subtest_thp_policy();
+	if (test__start_subtest("policy_update"))
+		subtest_thp_policy_update();
 
 	thp_adjust_destroy();
 }
diff --git a/tools/testing/selftests/bpf/progs/test_thp_adjust.c b/tools/testing/selftests/bpf/progs/test_thp_adjust.c
index 93c7927e827a..175e65c5899f 100644
--- a/tools/testing/selftests/bpf/progs/test_thp_adjust.c
+++ b/tools/testing/selftests/bpf/progs/test_thp_adjust.c
@@ -98,3 +98,17 @@ SEC(".struct_ops.link")
 struct bpf_thp_ops khugepaged_ops = {
 	.thp_get_order = (void *)alloc_in_khugepaged,
 };
+
+SEC("struct_ops/thp_get_order")
+int BPF_PROG(alloc_not_in_swap, struct vm_area_struct *vma, enum bpf_thp_vma_type vma_type,
+	     enum tva_type tva_type, unsigned long orders)
+{
+	if (tva_type == TVA_SWAP)
+		return 0;
+	return -1;
+}
+
+SEC(".struct_ops.link")
+struct bpf_thp_ops swap_ops = {
+	.thp_get_order = (void *)alloc_not_in_swap,
+};
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v7 mm-new 09/10] selftests/bpf: add test cases for invalid thp_adjust usage
  2025-09-10  2:44 [PATCH v7 mm-new 0/9] mm, bpf: BPF based THP order selection Yafang Shao
                   ` (7 preceding siblings ...)
  2025-09-10  2:44 ` [PATCH v7 mm-new 08/10] selftests/bpf: add test case to update " Yafang Shao
@ 2025-09-10  2:44 ` Yafang Shao
  2025-09-10  2:44 ` [PATCH v7 mm-new 10/10] Documentation: add BPF-based THP policy management Yafang Shao
  2025-09-10 11:11 ` [PATCH v7 mm-new 0/9] mm, bpf: BPF based THP order selection Lance Yang
  10 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2025-09-10  2:44 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, 21cnbao, shakeel.butt
  Cc: bpf, linux-mm, linux-doc, Yafang Shao

1. The trusted vma->vm_mm pointer can be null and must be checked before
   dereferencing.
2. The trusted mm->owner pointer can be null and must be checked before
   dereferencing.
3. Sleepable programs are prohibited because the call site operates under
   RCU protection.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 .../selftests/bpf/prog_tests/thp_adjust.c     |  7 +++++
 .../bpf/progs/test_thp_adjust_sleepable.c     | 22 ++++++++++++++
 .../bpf/progs/test_thp_adjust_trusted_owner.c | 30 +++++++++++++++++++
 .../bpf/progs/test_thp_adjust_trusted_vma.c   | 27 +++++++++++++++++
 4 files changed, 86 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_sleepable.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_trusted_owner.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_trusted_vma.c

diff --git a/tools/testing/selftests/bpf/prog_tests/thp_adjust.c b/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
index 30172f2ee5d5..bbe1a82345ef 100644
--- a/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
+++ b/tools/testing/selftests/bpf/prog_tests/thp_adjust.c
@@ -5,6 +5,9 @@
 #include <test_progs.h>
 #include "cgroup_helpers.h"
 #include "test_thp_adjust.skel.h"
+#include "test_thp_adjust_sleepable.skel.h"
+#include "test_thp_adjust_trusted_vma.skel.h"
+#include "test_thp_adjust_trusted_owner.skel.h"
 
 #define LEN (16 * 1024 * 1024) /* 16MB */
 #define THP_ENABLED_FILE "/sys/kernel/mm/transparent_hugepage/enabled"
@@ -274,4 +277,8 @@ void test_thp_adjust(void)
 		subtest_thp_policy_update();
 
 	thp_adjust_destroy();
+
+	RUN_TESTS(test_thp_adjust_trusted_vma);
+	RUN_TESTS(test_thp_adjust_trusted_owner);
+	RUN_TESTS(test_thp_adjust_sleepable);
 }
diff --git a/tools/testing/selftests/bpf/progs/test_thp_adjust_sleepable.c b/tools/testing/selftests/bpf/progs/test_thp_adjust_sleepable.c
new file mode 100644
index 000000000000..9b92359f9789
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_thp_adjust_sleepable.c
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+#include "bpf_misc.h"
+
+char _license[] SEC("license") = "GPL";
+
+SEC("struct_ops.s/thp_get_order")
+__failure __msg("attach to unsupported member thp_get_order of struct bpf_thp_ops")
+int BPF_PROG(thp_sleepable, struct vm_area_struct *vma, enum bpf_thp_vma_type vma_type,
+	     enum tva_type tva_type, unsigned long orders)
+{
+	return -1;
+}
+
+SEC(".struct_ops.link")
+struct bpf_thp_ops vma_ops = {
+	.thp_get_order = (void *)thp_sleepable,
+};
diff --git a/tools/testing/selftests/bpf/progs/test_thp_adjust_trusted_owner.c b/tools/testing/selftests/bpf/progs/test_thp_adjust_trusted_owner.c
new file mode 100644
index 000000000000..b3f98c2a9b43
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_thp_adjust_trusted_owner.c
@@ -0,0 +1,30 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+#include "bpf_misc.h"
+
+char _license[] SEC("license") = "GPL";
+
+SEC("struct_ops/thp_get_order")
+__failure __msg("R3 pointer arithmetic on rcu_ptr_or_null_ prohibited, null-check it first")
+int BPF_PROG(thp_trusted_owner, struct vm_area_struct *vma, enum bpf_thp_vma_type vma_type,
+	     enum tva_type tva_type, unsigned long orders)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct task_struct *p;
+
+	if (!mm)
+		return 0;
+
+	p = mm->owner;
+	bpf_printk("The task name is %s\n", p->comm);
+	return -1;
+}
+
+SEC(".struct_ops.link")
+struct bpf_thp_ops vma_ops = {
+	.thp_get_order = (void *)thp_trusted_owner,
+};
diff --git a/tools/testing/selftests/bpf/progs/test_thp_adjust_trusted_vma.c b/tools/testing/selftests/bpf/progs/test_thp_adjust_trusted_vma.c
new file mode 100644
index 000000000000..5ce100348714
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_thp_adjust_trusted_vma.c
@@ -0,0 +1,27 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+#include "bpf_misc.h"
+
+char _license[] SEC("license") = "GPL";
+
+SEC("struct_ops/thp_get_order")
+__failure __msg("R1 invalid mem access 'trusted_ptr_or_null_'")
+int BPF_PROG(thp_trusted_vma, struct vm_area_struct *vma, enum bpf_thp_vma_type vma_type,
+	     enum tva_type tva_type, unsigned long orders)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct task_struct *p = mm->owner;
+
+	if (!p)
+		return 0;
+	return -1;
+}
+
+SEC(".struct_ops.link")
+struct bpf_thp_ops vma_ops = {
+	.thp_get_order = (void *)thp_trusted_vma,
+};
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH v7 mm-new 10/10] Documentation: add BPF-based THP policy management
  2025-09-10  2:44 [PATCH v7 mm-new 0/9] mm, bpf: BPF based THP order selection Yafang Shao
                   ` (8 preceding siblings ...)
  2025-09-10  2:44 ` [PATCH v7 mm-new 09/10] selftests/bpf: add test cases for invalid thp_adjust usage Yafang Shao
@ 2025-09-10  2:44 ` Yafang Shao
  2025-09-10 11:11 ` [PATCH v7 mm-new 0/9] mm, bpf: BPF based THP order selection Lance Yang
  10 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2025-09-10  2:44 UTC (permalink / raw)
  To: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, 21cnbao, shakeel.butt
  Cc: bpf, linux-mm, linux-doc, Yafang Shao

Add the documentation.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 Documentation/admin-guide/mm/transhuge.rst | 46 ++++++++++++++++++++++
 1 file changed, 46 insertions(+)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 1654211cc6cf..1e072eaacf65 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -738,3 +738,49 @@ support enabled just fine as always. No difference can be noted in
 hugetlbfs other than there will be less overall fragmentation. All
 usual features belonging to hugetlbfs are preserved and
 unaffected. libhugetlbfs will also work fine as usual.
+
+BPF-based THP adjustment
+========================
+
+Overview
+--------
+
+When the system is configured with "always" or "madvise" THP mode, a BPF program
+can be used to adjust THP allocation policies dynamically. This enables
+fine-grained control over THP decisions based on various factors including
+workload identity, allocation context, and system memory pressure.
+
+Program Interface
+-----------------
+
+This feature implements a struct_ops BPF program with the following interface::
+
+  int thp_get_order(struct vm_area_struct *vma,
+                    enum bpf_thp_vma_type vma_type,
+                    enum tva_type tva_type,
+                    unsigned long orders);
+
+Parameters::
+
+  @vma: vm_area_struct associated with the THP allocation
+  @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
+             BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE
+             if neither is set.
+  @tva_type: TVA type for current @vma
+  @orders: Bitmask of requested THP orders for this allocation
+           - PMD-mapped allocation if PMD_ORDER is set
+           - mTHP allocation otherwise
+
+Return value::
+
+  The suggested THP order from the BPF program for allocation. It will not
+  exceed the highest requested order in @orders. Return -1 to indicate that the
+  original requested @orders should remain unchanged.
+
+Implementation Notes
+--------------------
+
+This is currently an experimental feature.
+CONFIG_BPF_GET_THP_ORDER must be enabled to use it.
+Only one BPF program can be attached at a time, but the program can be updated
+dynamically to adjust policies without requiring affected tasks to be restarted.
-- 
2.47.3



^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 01/10] mm: thp: remove disabled task from khugepaged_mm_slot
  2025-09-10  2:44 ` [PATCH v7 mm-new 01/10] mm: thp: remove disabled task from khugepaged_mm_slot Yafang Shao
@ 2025-09-10  5:11   ` Lance Yang
  2025-09-10  6:17     ` Yafang Shao
  2025-09-10  7:21   ` Lance Yang
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 61+ messages in thread
From: Lance Yang @ 2025-09-10  5:11 UTC (permalink / raw)
  To: Yafang Shao
  Cc: bpf, linux-mm, linux-doc, Lance Yang, david, akpm, baolin.wang,
	ziy, hannes, corbet, ameryhung, 21cnbao, shakeel.butt, rientjes,
	andrii, daniel, ast, ryan.roberts, gutierrez.asier, willy,
	usamaarif642, lorenzo.stoakes, npache, dev.jain, Liam.Howlett

Hey Yafang,

On 2025/9/10 10:44, Yafang Shao wrote:
> Since a task with MMF_DISABLE_THP_COMPLETELY cannot use THP, remove it from
> the khugepaged_mm_slot to stop khugepaged from processing it.
> 
> After this change, the following semantic relationship always holds:
> 
>    MMF_VM_HUGEPAGE is set     == task is in khugepaged mm_slot
>    MMF_VM_HUGEPAGE is not set == task is not in khugepaged mm_slot
> 
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> Cc: Lance Yang <ioworker0@gmail.com>
> ---
>   include/linux/khugepaged.h |  1 +
>   kernel/sys.c               |  6 ++++++
>   mm/khugepaged.c            | 19 +++++++++----------
>   3 files changed, 16 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> index eb1946a70cff..6cb9107f1006 100644
> --- a/include/linux/khugepaged.h
> +++ b/include/linux/khugepaged.h
> @@ -19,6 +19,7 @@ extern void khugepaged_min_free_kbytes_update(void);
>   extern bool current_is_khugepaged(void);
>   extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>   				   bool install_pmd);
> +bool hugepage_pmd_enabled(void);
>   
>   static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
>   {
> diff --git a/kernel/sys.c b/kernel/sys.c
> index a46d9b75880b..a1c1e8007f2d 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -8,6 +8,7 @@
>   #include <linux/export.h>
>   #include <linux/mm.h>
>   #include <linux/mm_inline.h>
> +#include <linux/khugepaged.h>
>   #include <linux/utsname.h>
>   #include <linux/mman.h>
>   #include <linux/reboot.h>
> @@ -2493,6 +2494,11 @@ static int prctl_set_thp_disable(bool thp_disable, unsigned long flags,
>   		mm_flags_clear(MMF_DISABLE_THP_COMPLETELY, mm);
>   		mm_flags_clear(MMF_DISABLE_THP_EXCEPT_ADVISED, mm);
>   	}
> +
> +	if (!mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm) &&
> +	    !mm_flags_test(MMF_VM_HUGEPAGE, mm) &&
> +	    hugepage_pmd_enabled())
> +		__khugepaged_enter(mm);
>   	mmap_write_unlock(current->mm);

One minor style suggestion for prctl_set_thp_disable():

static int prctl_set_thp_disable(bool thp_disable, unsigned long flags,
				 unsigned long arg4, unsigned long arg5)
{
	struct mm_struct *mm = current->mm;

	[...]
	if (mmap_write_lock_killable(current->mm))
		return -EINTR;
	[...]
	mmap_write_unlock(current->mm);
	return 0;
}

It initializes struct mm_struct *mm = current->mm; at the beginning, but 
then uses both mm and current->mm. Could you change the calls using
current->mm to use the local mm variable for consistency? Just a nit ;)

Cheers,
Lance
>   	return 0;
>   }
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 4ec324a4c1fe..88ac482fb3a0 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -413,7 +413,7 @@ static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm)
>   		mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm);
>   }
>   
> -static bool hugepage_pmd_enabled(void)
> +bool hugepage_pmd_enabled(void)
>   {
>   	/*
>   	 * We cover the anon, shmem and the file-backed case here; file-backed
> @@ -445,6 +445,7 @@ void __khugepaged_enter(struct mm_struct *mm)
>   
>   	/* __khugepaged_exit() must not run from under us */
>   	VM_BUG_ON_MM(hpage_collapse_test_exit(mm), mm);
> +	WARN_ON_ONCE(mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm));
>   	if (unlikely(mm_flags_test_and_set(MMF_VM_HUGEPAGE, mm)))
>   		return;
>   
> @@ -472,7 +473,8 @@ void __khugepaged_enter(struct mm_struct *mm)
>   void khugepaged_enter_vma(struct vm_area_struct *vma,
>   			  vm_flags_t vm_flags)
>   {
> -	if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
> +	if (!mm_flags_test(MMF_DISABLE_THP_COMPLETELY, vma->vm_mm) &&
> +	    !mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
>   	    hugepage_pmd_enabled()) {
>   		if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
>   			__khugepaged_enter(vma->vm_mm);
> @@ -1451,16 +1453,13 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
>   
>   	lockdep_assert_held(&khugepaged_mm_lock);
>   
> -	if (hpage_collapse_test_exit(mm)) {
> +	if (hpage_collapse_test_exit_or_disable(mm)) {
>   		/* free mm_slot */
>   		hash_del(&slot->hash);
>   		list_del(&slot->mm_node);
>   
> -		/*
> -		 * Not strictly needed because the mm exited already.
> -		 *
> -		 * mm_flags_clear(MMF_VM_HUGEPAGE, mm);
> -		 */
> +		/* If the mm is disabled, this flag must be cleared. */
> +		mm_flags_clear(MMF_VM_HUGEPAGE, mm);
>   
>   		/* khugepaged_mm_lock actually not necessary for the below */
>   		mm_slot_free(mm_slot_cache, mm_slot);
> @@ -2507,9 +2506,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>   	VM_BUG_ON(khugepaged_scan.mm_slot != mm_slot);
>   	/*
>   	 * Release the current mm_slot if this mm is about to die, or
> -	 * if we scanned all vmas of this mm.
> +	 * if we scanned all vmas of this mm, or if this mm is disabled.
>   	 */
> -	if (hpage_collapse_test_exit(mm) || !vma) {
> +	if (hpage_collapse_test_exit_or_disable(mm) || !vma) {
>   		/*
>   		 * Make sure that if mm_users is reaching zero while
>   		 * khugepaged runs here, khugepaged_exit will find



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 01/10] mm: thp: remove disabled task from khugepaged_mm_slot
  2025-09-10  5:11   ` Lance Yang
@ 2025-09-10  6:17     ` Yafang Shao
  0 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2025-09-10  6:17 UTC (permalink / raw)
  To: Lance Yang, Andrew Morton, David Hildenbrand
  Cc: bpf, linux-mm, linux-doc, Lance Yang, baolin.wang, ziy, hannes,
	corbet, ameryhung, 21cnbao, shakeel.butt, rientjes, andrii,
	daniel, ast, ryan.roberts, gutierrez.asier, willy, usamaarif642,
	lorenzo.stoakes, npache, dev.jain, Liam.Howlett

On Wed, Sep 10, 2025 at 1:11 PM Lance Yang <lance.yang@linux.dev> wrote:
>
> Hey Yafang,
>
> On 2025/9/10 10:44, Yafang Shao wrote:
> > Since a task with MMF_DISABLE_THP_COMPLETELY cannot use THP, remove it from
> > the khugepaged_mm_slot to stop khugepaged from processing it.
> >
> > After this change, the following semantic relationship always holds:
> >
> >    MMF_VM_HUGEPAGE is set     == task is in khugepaged mm_slot
> >    MMF_VM_HUGEPAGE is not set == task is not in khugepaged mm_slot
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > Cc: Lance Yang <ioworker0@gmail.com>
> > ---
> >   include/linux/khugepaged.h |  1 +
> >   kernel/sys.c               |  6 ++++++
> >   mm/khugepaged.c            | 19 +++++++++----------
> >   3 files changed, 16 insertions(+), 10 deletions(-)
> >
> > diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> > index eb1946a70cff..6cb9107f1006 100644
> > --- a/include/linux/khugepaged.h
> > +++ b/include/linux/khugepaged.h
> > @@ -19,6 +19,7 @@ extern void khugepaged_min_free_kbytes_update(void);
> >   extern bool current_is_khugepaged(void);
> >   extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> >                                  bool install_pmd);
> > +bool hugepage_pmd_enabled(void);
> >
> >   static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
> >   {
> > diff --git a/kernel/sys.c b/kernel/sys.c
> > index a46d9b75880b..a1c1e8007f2d 100644
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -8,6 +8,7 @@
> >   #include <linux/export.h>
> >   #include <linux/mm.h>
> >   #include <linux/mm_inline.h>
> > +#include <linux/khugepaged.h>
> >   #include <linux/utsname.h>
> >   #include <linux/mman.h>
> >   #include <linux/reboot.h>
> > @@ -2493,6 +2494,11 @@ static int prctl_set_thp_disable(bool thp_disable, unsigned long flags,
> >               mm_flags_clear(MMF_DISABLE_THP_COMPLETELY, mm);
> >               mm_flags_clear(MMF_DISABLE_THP_EXCEPT_ADVISED, mm);
> >       }
> > +
> > +     if (!mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm) &&
> > +         !mm_flags_test(MMF_VM_HUGEPAGE, mm) &&
> > +         hugepage_pmd_enabled())
> > +             __khugepaged_enter(mm);
> >       mmap_write_unlock(current->mm);
>
> One minor style suggestion for prctl_set_thp_disable():
>
> static int prctl_set_thp_disable(bool thp_disable, unsigned long flags,
>                                  unsigned long arg4, unsigned long arg5)
> {
>         struct mm_struct *mm = current->mm;
>
>         [...]
>         if (mmap_write_lock_killable(current->mm))
>                 return -EINTR;
>         [...]
>         mmap_write_unlock(current->mm);
>         return 0;
> }
>
> It initializes struct mm_struct *mm = current->mm; at the beginning, but
> then uses both mm and current->mm. Could you change the calls using
> current->mm to use the local mm variable for consistency? Just a nit ;)

Nice catch

Hello Andrew, David,

The original commit "prctl: extend PR_SET_THP_DISABLE to optionally
exclude VM_HUGEPAGE" is still in mm-new branch. The change below is a
minor cleanup for it.

Could we please fold this change directly into the original commit to
keep the history clean?

diff --git a/kernel/sys.c b/kernel/sys.c
index a46d9b7..2250a32 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2479,7 +2479,7 @@ static int prctl_set_thp_disable(bool
thp_disable, unsigned long flags,
        /* Flags are only allowed when disabling. */
        if ((!thp_disable && flags) || (flags & ~PR_THP_DISABLE_EXCEPT_ADVISED))
                return -EINVAL;
-       if (mmap_write_lock_killable(current->mm))
+       if (mmap_write_lock_killable(mm))
                return -EINTR;
        if (thp_disable) {
                if (flags & PR_THP_DISABLE_EXCEPT_ADVISED) {
@@ -2493,7 +2493,7 @@ static int prctl_set_thp_disable(bool
thp_disable, unsigned long flags,
                mm_flags_clear(MMF_DISABLE_THP_COMPLETELY, mm);
                mm_flags_clear(MMF_DISABLE_THP_EXCEPT_ADVISED, mm);
        }
-       mmap_write_unlock(current->mm);
+       mmap_write_unlock(mm);
        return 0;
 }


-- 
Regards
Yafang


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 01/10] mm: thp: remove disabled task from khugepaged_mm_slot
  2025-09-10  2:44 ` [PATCH v7 mm-new 01/10] mm: thp: remove disabled task from khugepaged_mm_slot Yafang Shao
  2025-09-10  5:11   ` Lance Yang
@ 2025-09-10  7:21   ` Lance Yang
  2025-09-10 17:27   ` kernel test robot
  2025-09-11 13:43   ` Lorenzo Stoakes
  3 siblings, 0 replies; 61+ messages in thread
From: Lance Yang @ 2025-09-10  7:21 UTC (permalink / raw)
  To: Yafang Shao
  Cc: bpf, linux-mm, linux-doc, Lance Yang, shakeel.butt, rientjes, ast,
	gutierrez.asier, 21cnbao, daniel, ameryhung, corbet, andrii,
	willy, usamaarif642, hannes, dev.jain, baolin.wang, ziy,
	lorenzo.stoakes, david, Liam.Howlett, ryan.roberts, npache, akpm



On 2025/9/10 10:44, Yafang Shao wrote:
> Since a task with MMF_DISABLE_THP_COMPLETELY cannot use THP, remove it from
> the khugepaged_mm_slot to stop khugepaged from processing it.
> 
> After this change, the following semantic relationship always holds:
> 
>    MMF_VM_HUGEPAGE is set     == task is in khugepaged mm_slot
>    MMF_VM_HUGEPAGE is not set == task is not in khugepaged mm_slot
> 
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>

Acked-by: Lance Yang <lance.yang@linux.dev>

Cheers,
Lance



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 0/9] mm, bpf: BPF based THP order selection
  2025-09-10  2:44 [PATCH v7 mm-new 0/9] mm, bpf: BPF based THP order selection Yafang Shao
                   ` (9 preceding siblings ...)
  2025-09-10  2:44 ` [PATCH v7 mm-new 10/10] Documentation: add BPF-based THP policy management Yafang Shao
@ 2025-09-10 11:11 ` Lance Yang
  10 siblings, 0 replies; 61+ messages in thread
From: Lance Yang @ 2025-09-10 11:11 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, 21cnbao, shakeel.butt, bpf, linux-mm, linux-doc,
	linux-kernel

Seems like we forgot to CC linux-kernel@vger.kernel.org ;p

On Wed, Sep 10, 2025 at 12:02 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> Background
> ==========
>
> Our production servers consistently configure THP to "never" due to
> historical incidents caused by its behavior. Key issues include:
> - Increased Memory Consumption
>   THP significantly raises overall memory usage, reducing available memory
>   for workloads.
>
> - Latency Spikes
>   Random latency spikes occur due to frequent memory compaction triggered
>   by THP.
>
> - Lack of Fine-Grained Control
>   THP tuning is globally configured, making it unsuitable for containerized
>   environments. When multiple workloads share a host, enabling THP without
>   per-workload control leads to unpredictable behavior.
>
> Due to these issues, administrators avoid switching to madvise or always
> modes—unless per-workload THP control is implemented.
>
> To address this, we propose BPF-based THP policy for flexible adjustment.
> Additionally, as David mentioned, this mechanism can also serve as a
> policy prototyping tool (test policies via BPF before upstreaming them).
>
> Proposed Solution
> =================
>
> This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
> THP tuning. It includes a hook thp_get_order(), allowing BPF programs to
> influence THP order selection based on factors such as:
>
> - Workload identity
>   For example, workloads running in specific containers or cgroups.
> - Allocation context
>   Whether the allocation occurs during a page fault, khugepaged, swap or
>   other paths.
> - VMA's memory advice settings
>   MADV_HUGEPAGE or MADV_NOHUGEPAGE
> - Memory pressure
>   PSI system data or associated cgroup PSI metrics
>
> The new interface for the BPF program is as follows:
>
> /**
>  * @thp_get_order: Get the suggested THP orders from a BPF program for allocation
>  * @vma: vm_area_struct associated with the THP allocation
>  * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
>  *            BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE
>  *            if neither is set.
>  * @tva_type: TVA type for current @vma
>  * @orders: Bitmask of requested THP orders for this allocation
>  *          - PMD-mapped allocation if PMD_ORDER is set
>  *          - mTHP allocation otherwise
>  *
>  * Return: The suggested THP order from the BPF program for allocation. It will
>  *         not exceed the highest requested order in @orders. Return -1 to
>  *         indicate that the original requested @orders should remain unchanged.
>  */
>
> int thp_get_order(struct vm_area_struct *vma,
>                   enum bpf_thp_vma_type vma_type,
>                   enum tva_type tva_type,
>                   unsigned long orders);
>
> Only a single BPF program can be attached at any given time, though it can
> be dynamically updated to adjust the policy. The implementation supports
> anonymous THP, shmem THP, and mTHP, with future extensions planned for
> file-backed THP.
>
> This functionality is only active when system-wide THP is configured to
> madvise or always mode. It remains disabled in never mode. Additionally,
> if THP is explicitly disabled for a specific task via prctl(), this BPF
> functionality will also be unavailable for that task
>
> **WARNING**
> - This feature requires CONFIG_BPF_GET_THP_ORDER (marked EXPERIMENTAL) to
>   be enabled.
> - The interface may change
> - Behavior may differ in future kernel versions
> - We might remove it in the future
>
> Selftests
> =========
>
> BPF CI
> ------
>
> Patch #7: Implements a basic BPF THP policy that restricts THP allocation
>           via khugepaged to tasks within a specified memory cgroup.
> Patch #8: Provides tests for dynamic BPF program updates and replacement.
> Patch #9: Includes negative tests for invalid BPF helper usage, verifying
>           proper verification by the BPF verifier.
>
> Currently, several dependency patches reside in mm-new but haven't been
> merged into bpf-next. To enable BPF CI testing, these dependencies were
> manually applied to bpf-next. All selftests in this series pass
> successfully [0].
>
> Performance Evaluation
> ----------------------
>
> Performance impact was measured given the page fault handler modifications.
> The standard `perf bench mem memset` benchmark was employed to assess page
> fault performance.
>
> Testing was conducted on an AMD EPYC 7W83 64-Core Processor (single NUMA
> node). Due to variance between individual test runs, a script executed
> 10000 iterations to calculate meaningful averages.
>
> - Baseline (without this patch series)
> - With patch series but no BPF program attached
> - With patch series and BPF program attached
>
> The results across three configurations show negligible performance impact:
>
>   Number of runs: 10,000
>   Average throughput: 40-41 GB/sec
>
> Production verification
> -----------------------
>
> We have successfully deployed a variant of this approach across numerous
> Kubernetes production servers. The implementation enables THP for specific
> workloads (such as applications utilizing ZGC [1]) while disabling it for
> others. This selective deployment has operated flawlessly, with no
> regression reports to date.
>
> For ZGC-based applications, our verification demonstrates that shmem THP
> delivers significant improvements:
> - Reduced CPU utilization
> - Lower average latencies
>
> We are continuously extending its support to more workloads, such as
> TCMalloc-based services. [2]
>
> Deployment Steps in our production servers are as follows,
>
> 1. Initial Setup:
> - Set THP mode to "never" (disabling THP by default).
> - Attach the BPF program and pin the BPF maps and links.
> - Pinning ensures persistence (like a kernel module), preventing
> disruption under system pressure.
> - A THP whitelist map tracks allowed cgroups (initially empty -> no THP
> allocations).
>
> 2. Enable THP Control:
> - Switch THP mode to "always" or "madvise" (BPF now governs actual allocations).
>
> 3. Dynamic Management:
> - To permit THP for a cgroup, add its ID to the whitelist map.
> - To revoke permission, remove the cgroup ID from the map.
> - The BPF program can be updated live (policy adjustments require no
> task interruption).
>
> 4. To roll back, disable THP and remove this BPF program.
>
> **WARNING**
> Be aware that the maintainers do not suggest this use case, as the BPF hook
> interface is unstable and might be removed from the upstream kernel—unless
> you have your own kernel team to maintain it ;-)
>
> Future work
> ===========
>
> file-backed THP policy
> ----------------------
>
> Based on our validation with production workloads, we observed mixed
> results with XFS large folios (also known as file-backed THP):
>
> - Performance Benefits
>   Some workloads demonstrated significant improvements with XFS large
>   folios enabled
> - Performance Regression
>   Some workloads experienced degradation when using XFS large folios
>
> These results demonstrate that File THP, similar to anonymous THP, requires
> a more granular approach instead of a uniform implementation.
>
> We will extend the BPF-based order selection mechanism to support
> file-backed THP allocation policies.
>
> Hooking fork() with BPF for Task Configuration
> ----------------------------------------------
>
> The current method for controlling a newly fork()-ed task involves calling
> prctl() (e.g., with PR_SET_THP_DISABLE) to set flags in its mm->flags. This
> requires explicit userspace modification.
>
> A more efficient alternative is to implement a new BPF hook within the
> fork() path. This hook would allow a BPF program to set the task's
> mm->flags directly after mm initialization, leveraging BPF helpers for a
> solution that is transparent to userspace. This is particularly valuable in
> data center environments for fleet-wide management.
>
> Link: https://github.com/kernel-patches/bpf/pull/9706 [0]
> Link: https://wiki.openjdk.org/display/zgc/Main#Main-EnablingTr... [1]
> Link: https://google.github.io/tcmalloc/tuning.html#system-level-optimizations [2]
>
> Changes:
> =======:
>
> v6->v7:
> Key Changes Implemented Based on Feedback:
> From Lorenzo:
>   - Rename the hook from get_suggested_order() to bpf_hook_get_thp_order().
>   - Rename bpf_thp.c to huge_memory_bpf.c
>   - Focuse the current patchset on THP order selection
>   - Add the BPF hook into thp_vma_allowable_orders()
>   - Make the hook VMA-based and remove the mm parameter
>   - Modify the BPF program to return a single order
>   - Stop passing vma_flags directly to BPF programs
>   - Mark vma->vm_mm as trusted_or_null
>   - Change the MAINTAINER file
> From Andrii:
>   - Mark mm->owner as rcu_or_null to avoid introducing new helpers
> From Barry:
>   - decouple swap from the normal page fault path
> kernel test robot:
>   - Fix a sparse warning
> Shakeel helped clarify the implementation.
>
> RFC v5-> v6: https://lwn.net/Articles/1035116/
> - Code improvement around the RCU usage (Usama)
> - Add selftests for khugepaged fork (Usama)
> - Add performance data for page fault (Usama)
> - Remove the RFC tag
>
> RFC v4->v5: https://lwn.net/Articles/1034265/
> - Add support for vma (David)
> - Add mTHP support in khugepaged (Zi)
> - Use bitmask of all allowed orders instead (Zi)
> - Retrieve the page size and PMD order rather than hardcoding them (Zi)
>
> RFC v3->v4: https://lwn.net/Articles/1031829/
> - Use a new interface get_suggested_order() (David)
> - Mark it as experimental (David, Lorenzo)
> - Code improvement in THP (Usama)
> - Code improvement in BPF struct ops (Amery)
>
> RFC v2->v3: https://lwn.net/Articles/1024545/
> - Finer-graind tuning based on madvise or always mode (David, Lorenzo)
> - Use BPF to write more advanced policies logic (David, Lorenzo)
>
> RFC v1->v2: https://lwn.net/Articles/1021783/
> The main changes are as follows,
> - Use struct_ops instead of fmod_ret (Alexei)
> - Introduce a new THP mode (Johannes)
> - Introduce new helpers for BPF hook (Zi)
> - Refine the commit log
>
> RFC v1: https://lwn.net/Articles/1019290/
>
> Yafang Shao (10):
>   mm: thp: remove disabled task from khugepaged_mm_slot
>   mm: thp: add support for BPF based THP order selection
>   mm: thp: decouple THP allocation between swap and page fault paths
>   mm: thp: enable THP allocation exclusively through khugepaged
>   bpf: mark mm->owner as __safe_rcu_or_null
>   bpf: mark vma->vm_mm as __safe_trusted_or_null
>   selftests/bpf: add a simple BPF based THP policy
>   selftests/bpf: add test case to update THP policy
>   selftests/bpf: add test cases for invalid thp_adjust usage
>   Documentation: add BPF-based THP policy management
>
>  Documentation/admin-guide/mm/transhuge.rst    |  46 +++
>  MAINTAINERS                                   |   3 +
>  include/linux/huge_mm.h                       |  29 +-
>  include/linux/khugepaged.h                    |   1 +
>  kernel/bpf/verifier.c                         |   8 +
>  kernel/sys.c                                  |   6 +
>  mm/Kconfig                                    |  12 +
>  mm/Makefile                                   |   1 +
>  mm/huge_memory.c                              |   3 +-
>  mm/huge_memory_bpf.c                          | 243 +++++++++++++++
>  mm/khugepaged.c                               |  19 +-
>  mm/memory.c                                   |  15 +-
>  tools/testing/selftests/bpf/config            |   3 +
>  .../selftests/bpf/prog_tests/thp_adjust.c     | 284 ++++++++++++++++++
>  tools/testing/selftests/bpf/progs/lsm.c       |   8 +-
>  .../selftests/bpf/progs/test_thp_adjust.c     | 114 +++++++
>  .../bpf/progs/test_thp_adjust_sleepable.c     |  22 ++
>  .../bpf/progs/test_thp_adjust_trusted_owner.c |  30 ++
>  .../bpf/progs/test_thp_adjust_trusted_vma.c   |  27 ++
>  19 files changed, 849 insertions(+), 25 deletions(-)
>  create mode 100644 mm/huge_memory_bpf.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/thp_adjust.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_sleepable.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_trusted_owner.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_thp_adjust_trusted_vma.c
>
> --
> 2.47.3
>
>


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 02/10] mm: thp: add support for BPF based THP order selection
  2025-09-10  2:44 ` [PATCH v7 mm-new 02/10] mm: thp: add support for BPF based THP order selection Yafang Shao
@ 2025-09-10 12:42   ` Lance Yang
  2025-09-10 12:54     ` Lance Yang
  2025-09-11 14:02     ` Lorenzo Stoakes
  2025-09-11 14:33   ` Lorenzo Stoakes
                     ` (2 subsequent siblings)
  3 siblings, 2 replies; 61+ messages in thread
From: Lance Yang @ 2025-09-10 12:42 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, 21cnbao, shakeel.butt, bpf, linux-mm, linux-doc

Hey Yafang,

On Wed, Sep 10, 2025 at 10:53 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
> THP tuning. It includes a hook bpf_hook_thp_get_order(), allowing BPF
> programs to influence THP order selection based on factors such as:
> - Workload identity
>   For example, workloads running in specific containers or cgroups.
> - Allocation context
>   Whether the allocation occurs during a page fault, khugepaged, swap or
>   other paths.
> - VMA's memory advice settings
>   MADV_HUGEPAGE or MADV_NOHUGEPAGE
> - Memory pressure
>   PSI system data or associated cgroup PSI metrics
>
> The kernel API of this new BPF hook is as follows,
>
> /**
>  * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation
>  * @vma: vm_area_struct associated with the THP allocation
>  * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
>  *            BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if
>  *            neither is set.
>  * @tva_type: TVA type for current @vma
>  * @orders: Bitmask of requested THP orders for this allocation
>  *          - PMD-mapped allocation if PMD_ORDER is set
>  *          - mTHP allocation otherwise
>  *
>  * Return: The suggested THP order from the BPF program for allocation. It will
>  *         not exceed the highest requested order in @orders. Return -1 to
>  *         indicate that the original requested @orders should remain unchanged.
>  */
> typedef int thp_order_fn_t(struct vm_area_struct *vma,
>                            enum bpf_thp_vma_type vma_type,
>                            enum tva_type tva_type,
>                            unsigned long orders);
>
> Only a single BPF program can be attached at any given time, though it can
> be dynamically updated to adjust the policy. The implementation supports
> anonymous THP, shmem THP, and mTHP, with future extensions planned for
> file-backed THP.
>
> This functionality is only active when system-wide THP is configured to
> madvise or always mode. It remains disabled in never mode. Additionally,
> if THP is explicitly disabled for a specific task via prctl(), this BPF
> functionality will also be unavailable for that task.
>
> This feature requires CONFIG_BPF_GET_THP_ORDER (marked EXPERIMENTAL) to be
> enabled. Note that this capability is currently unstable and may undergo
> significant changes—including potential removal—in future kernel versions.
>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
[...]
> diff --git a/mm/huge_memory_bpf.c b/mm/huge_memory_bpf.c
> new file mode 100644
> index 000000000000..525ee22ab598
> --- /dev/null
> +++ b/mm/huge_memory_bpf.c
> @@ -0,0 +1,243 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * BPF-based THP policy management
> + *
> + * Author: Yafang Shao <laoar.shao@gmail.com>
> + */
> +
> +#include <linux/bpf.h>
> +#include <linux/btf.h>
> +#include <linux/huge_mm.h>
> +#include <linux/khugepaged.h>
> +
> +enum bpf_thp_vma_type {
> +       BPF_THP_VM_NONE = 0,
> +       BPF_THP_VM_HUGEPAGE,    /* VM_HUGEPAGE */
> +       BPF_THP_VM_NOHUGEPAGE,  /* VM_NOHUGEPAGE */
> +};
> +
> +/**
> + * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation
> + * @vma: vm_area_struct associated with the THP allocation
> + * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
> + *            BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if
> + *            neither is set.
> + * @tva_type: TVA type for current @vma
> + * @orders: Bitmask of requested THP orders for this allocation
> + *          - PMD-mapped allocation if PMD_ORDER is set
> + *          - mTHP allocation otherwise
> + *
> + * Return: The suggested THP order from the BPF program for allocation. It will
> + *         not exceed the highest requested order in @orders. Return -1 to
> + *         indicate that the original requested @orders should remain unchanged.

A minor documentation nit: the comment says "Return -1 to indicate that the
original requested @orders should remain unchanged". It might be slightly
clearer to say "Return a negative value to fall back to the original
behavior". This would cover all error codes as well ;)

> + */
> +typedef int thp_order_fn_t(struct vm_area_struct *vma,
> +                          enum bpf_thp_vma_type vma_type,
> +                          enum tva_type tva_type,
> +                          unsigned long orders);

Sorry if I'm missing some context here since I haven't tracked the whole
series closely.

Regarding the return value for thp_order_fn_t: right now it returns a
single int order. I was thinking, what if we let it return an unsigned
long bitmask of orders instead? This seems like it would be more flexible
down the road, especially if we get more mTHP sizes to choose from. It
would also make the API more consistent, as bpf_hook_thp_get_orders()
itself returns an unsigned long ;)

Also, for future extensions, it might be a good idea to add a reserved
flags argument to the thp_order_fn_t signature.

For example thp_order_fn_t(..., unsigned long flags).

This would give us aforward-compatible way to add new semantics later
without breaking the ABI and needing a v2. We could just require it to be
0 for now.

Thanks for the great work!
Lance


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 02/10] mm: thp: add support for BPF based THP order selection
  2025-09-10 12:42   ` Lance Yang
@ 2025-09-10 12:54     ` Lance Yang
  2025-09-10 13:56       ` Lance Yang
  2025-09-11 14:02     ` Lorenzo Stoakes
  1 sibling, 1 reply; 61+ messages in thread
From: Lance Yang @ 2025-09-10 12:54 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, 21cnbao, shakeel.butt, bpf, linux-mm, linux-doc,
	Lance Yang

On Wed, Sep 10, 2025 at 8:42 PM Lance Yang <lance.yang@linux.dev> wrote:
>
> Hey Yafang,
>
> On Wed, Sep 10, 2025 at 10:53 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
> > THP tuning. It includes a hook bpf_hook_thp_get_order(), allowing BPF
> > programs to influence THP order selection based on factors such as:
> > - Workload identity
> >   For example, workloads running in specific containers or cgroups.
> > - Allocation context
> >   Whether the allocation occurs during a page fault, khugepaged, swap or
> >   other paths.
> > - VMA's memory advice settings
> >   MADV_HUGEPAGE or MADV_NOHUGEPAGE
> > - Memory pressure
> >   PSI system data or associated cgroup PSI metrics
> >
> > The kernel API of this new BPF hook is as follows,
> >
> > /**
> >  * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation
> >  * @vma: vm_area_struct associated with the THP allocation
> >  * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
> >  *            BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if
> >  *            neither is set.
> >  * @tva_type: TVA type for current @vma
> >  * @orders: Bitmask of requested THP orders for this allocation
> >  *          - PMD-mapped allocation if PMD_ORDER is set
> >  *          - mTHP allocation otherwise
> >  *
> >  * Return: The suggested THP order from the BPF program for allocation. It will
> >  *         not exceed the highest requested order in @orders. Return -1 to
> >  *         indicate that the original requested @orders should remain unchanged.
> >  */
> > typedef int thp_order_fn_t(struct vm_area_struct *vma,
> >                            enum bpf_thp_vma_type vma_type,
> >                            enum tva_type tva_type,
> >                            unsigned long orders);
> >
> > Only a single BPF program can be attached at any given time, though it can
> > be dynamically updated to adjust the policy. The implementation supports
> > anonymous THP, shmem THP, and mTHP, with future extensions planned for
> > file-backed THP.
> >
> > This functionality is only active when system-wide THP is configured to
> > madvise or always mode. It remains disabled in never mode. Additionally,
> > if THP is explicitly disabled for a specific task via prctl(), this BPF
> > functionality will also be unavailable for that task.
> >
> > This feature requires CONFIG_BPF_GET_THP_ORDER (marked EXPERIMENTAL) to be
> > enabled. Note that this capability is currently unstable and may undergo
> > significant changes—including potential removal—in future kernel versions.
> >
> > Suggested-by: David Hildenbrand <david@redhat.com>
> > Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> [...]
> > diff --git a/mm/huge_memory_bpf.c b/mm/huge_memory_bpf.c
> > new file mode 100644
> > index 000000000000..525ee22ab598
> > --- /dev/null
> > +++ b/mm/huge_memory_bpf.c
> > @@ -0,0 +1,243 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * BPF-based THP policy management
> > + *
> > + * Author: Yafang Shao <laoar.shao@gmail.com>
> > + */
> > +
> > +#include <linux/bpf.h>
> > +#include <linux/btf.h>
> > +#include <linux/huge_mm.h>
> > +#include <linux/khugepaged.h>
> > +
> > +enum bpf_thp_vma_type {
> > +       BPF_THP_VM_NONE = 0,
> > +       BPF_THP_VM_HUGEPAGE,    /* VM_HUGEPAGE */
> > +       BPF_THP_VM_NOHUGEPAGE,  /* VM_NOHUGEPAGE */
> > +};
> > +
> > +/**
> > + * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation
> > + * @vma: vm_area_struct associated with the THP allocation
> > + * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
> > + *            BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if
> > + *            neither is set.
> > + * @tva_type: TVA type for current @vma
> > + * @orders: Bitmask of requested THP orders for this allocation
> > + *          - PMD-mapped allocation if PMD_ORDER is set
> > + *          - mTHP allocation otherwise
> > + *
> > + * Return: The suggested THP order from the BPF program for allocation. It will
> > + *         not exceed the highest requested order in @orders. Return -1 to
> > + *         indicate that the original requested @orders should remain unchanged.
>
> A minor documentation nit: the comment says "Return -1 to indicate that the
> original requested @orders should remain unchanged". It might be slightly
> clearer to say "Return a negative value to fall back to the original
> behavior". This would cover all error codes as well ;)
>
> > + */
> > +typedef int thp_order_fn_t(struct vm_area_struct *vma,
> > +                          enum bpf_thp_vma_type vma_type,
> > +                          enum tva_type tva_type,
> > +                          unsigned long orders);
>
> Sorry if I'm missing some context here since I haven't tracked the whole
> series closely.
>
> Regarding the return value for thp_order_fn_t: right now it returns a
> single int order. I was thinking, what if we let it return an unsigned
> long bitmask of orders instead? This seems like it would be more flexible
> down the road, especially if we get more mTHP sizes to choose from. It
> would also make the API more consistent, as bpf_hook_thp_get_orders()
> itself returns an unsigned long ;)

I just realized a flaw in my previous suggestion :(

Changing the return type of thp_order_fn_t to unsigned long for consistency
and flexibility. However, I completely overlooked that this would prevent
the BPF program from returning negative error codes ...

Thanks,
Lance

>
> Also, for future extensions, it might be a good idea to add a reserved
> flags argument to the thp_order_fn_t signature.
>
> For example thp_order_fn_t(..., unsigned long flags).
>
> This would give us aforward-compatible way to add new semantics later
> without breaking the ABI and needing a v2. We could just require it to be
> 0 for now.
>
> Thanks for the great work!
> Lance


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 02/10] mm: thp: add support for BPF based THP order selection
  2025-09-10 12:54     ` Lance Yang
@ 2025-09-10 13:56       ` Lance Yang
  2025-09-11  2:48         ` Yafang Shao
  2025-09-11 14:45         ` Lorenzo Stoakes
  0 siblings, 2 replies; 61+ messages in thread
From: Lance Yang @ 2025-09-10 13:56 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, 21cnbao, shakeel.butt, bpf, linux-mm, linux-doc



On 2025/9/10 20:54, Lance Yang wrote:
> On Wed, Sep 10, 2025 at 8:42 PM Lance Yang <lance.yang@linux.dev> wrote:
>>
>> Hey Yafang,
>>
>> On Wed, Sep 10, 2025 at 10:53 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>>>
>>> This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
>>> THP tuning. It includes a hook bpf_hook_thp_get_order(), allowing BPF
>>> programs to influence THP order selection based on factors such as:
>>> - Workload identity
>>>    For example, workloads running in specific containers or cgroups.
>>> - Allocation context
>>>    Whether the allocation occurs during a page fault, khugepaged, swap or
>>>    other paths.
>>> - VMA's memory advice settings
>>>    MADV_HUGEPAGE or MADV_NOHUGEPAGE
>>> - Memory pressure
>>>    PSI system data or associated cgroup PSI metrics
>>>
>>> The kernel API of this new BPF hook is as follows,
>>>
>>> /**
>>>   * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation
>>>   * @vma: vm_area_struct associated with the THP allocation
>>>   * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
>>>   *            BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if
>>>   *            neither is set.
>>>   * @tva_type: TVA type for current @vma
>>>   * @orders: Bitmask of requested THP orders for this allocation
>>>   *          - PMD-mapped allocation if PMD_ORDER is set
>>>   *          - mTHP allocation otherwise
>>>   *
>>>   * Return: The suggested THP order from the BPF program for allocation. It will
>>>   *         not exceed the highest requested order in @orders. Return -1 to
>>>   *         indicate that the original requested @orders should remain unchanged.
>>>   */
>>> typedef int thp_order_fn_t(struct vm_area_struct *vma,
>>>                             enum bpf_thp_vma_type vma_type,
>>>                             enum tva_type tva_type,
>>>                             unsigned long orders);
>>>
>>> Only a single BPF program can be attached at any given time, though it can
>>> be dynamically updated to adjust the policy. The implementation supports
>>> anonymous THP, shmem THP, and mTHP, with future extensions planned for
>>> file-backed THP.
>>>
>>> This functionality is only active when system-wide THP is configured to
>>> madvise or always mode. It remains disabled in never mode. Additionally,
>>> if THP is explicitly disabled for a specific task via prctl(), this BPF
>>> functionality will also be unavailable for that task.
>>>
>>> This feature requires CONFIG_BPF_GET_THP_ORDER (marked EXPERIMENTAL) to be
>>> enabled. Note that this capability is currently unstable and may undergo
>>> significant changes—including potential removal—in future kernel versions.
>>>
>>> Suggested-by: David Hildenbrand <david@redhat.com>
>>> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
>>> ---
>> [...]
>>> diff --git a/mm/huge_memory_bpf.c b/mm/huge_memory_bpf.c
>>> new file mode 100644
>>> index 000000000000..525ee22ab598
>>> --- /dev/null
>>> +++ b/mm/huge_memory_bpf.c
>>> @@ -0,0 +1,243 @@
>>> +// SPDX-License-Identifier: GPL-2.0
>>> +/*
>>> + * BPF-based THP policy management
>>> + *
>>> + * Author: Yafang Shao <laoar.shao@gmail.com>
>>> + */
>>> +
>>> +#include <linux/bpf.h>
>>> +#include <linux/btf.h>
>>> +#include <linux/huge_mm.h>
>>> +#include <linux/khugepaged.h>
>>> +
>>> +enum bpf_thp_vma_type {
>>> +       BPF_THP_VM_NONE = 0,
>>> +       BPF_THP_VM_HUGEPAGE,    /* VM_HUGEPAGE */
>>> +       BPF_THP_VM_NOHUGEPAGE,  /* VM_NOHUGEPAGE */
>>> +};
>>> +
>>> +/**
>>> + * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation
>>> + * @vma: vm_area_struct associated with the THP allocation
>>> + * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
>>> + *            BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if
>>> + *            neither is set.
>>> + * @tva_type: TVA type for current @vma
>>> + * @orders: Bitmask of requested THP orders for this allocation
>>> + *          - PMD-mapped allocation if PMD_ORDER is set
>>> + *          - mTHP allocation otherwise
>>> + *
>>> + * Return: The suggested THP order from the BPF program for allocation. It will
>>> + *         not exceed the highest requested order in @orders. Return -1 to
>>> + *         indicate that the original requested @orders should remain unchanged.
>>
>> A minor documentation nit: the comment says "Return -1 to indicate that the
>> original requested @orders should remain unchanged". It might be slightly
>> clearer to say "Return a negative value to fall back to the original
>> behavior". This would cover all error codes as well ;)
>>
>>> + */
>>> +typedef int thp_order_fn_t(struct vm_area_struct *vma,
>>> +                          enum bpf_thp_vma_type vma_type,
>>> +                          enum tva_type tva_type,
>>> +                          unsigned long orders);
>>
>> Sorry if I'm missing some context here since I haven't tracked the whole
>> series closely.
>>
>> Regarding the return value for thp_order_fn_t: right now it returns a
>> single int order. I was thinking, what if we let it return an unsigned
>> long bitmask of orders instead? This seems like it would be more flexible
>> down the road, especially if we get more mTHP sizes to choose from. It
>> would also make the API more consistent, as bpf_hook_thp_get_orders()
>> itself returns an unsigned long ;)
> 
> I just realized a flaw in my previous suggestion :(
> 
> Changing the return type of thp_order_fn_t to unsigned long for consistency
> and flexibility. However, I completely overlooked that this would prevent
> the BPF program from returning negative error codes ...
> 
> Thanks,
> Lance
> 
>>
>> Also, for future extensions, it might be a good idea to add a reserved
>> flags argument to the thp_order_fn_t signature.
>>
>> For example thp_order_fn_t(..., unsigned long flags).
>>
>> This would give us aforward-compatible way to add new semantics later
>> without breaking the ABI and needing a v2. We could just require it to be
>> 0 for now.
>>
>> Thanks for the great work!
>> Lance


Forgot to add:

Noticed that if the hook returns 0, bpf_hook_thp_get_orders() falls
back to 'orders', preventing us from dynamically disabling mTHP
allocations.

Honoring a return of 0 is critical for our use case, which is to
dynamically disable mTHP for low-priority containers when memory gets
low in mixed workloads.

And then re-enable it for them when memory is back above the low
watermark.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 01/10] mm: thp: remove disabled task from khugepaged_mm_slot
  2025-09-10  2:44 ` [PATCH v7 mm-new 01/10] mm: thp: remove disabled task from khugepaged_mm_slot Yafang Shao
  2025-09-10  5:11   ` Lance Yang
  2025-09-10  7:21   ` Lance Yang
@ 2025-09-10 17:27   ` kernel test robot
  2025-09-11  2:12     ` Lance Yang
  2025-09-11 13:43   ` Lorenzo Stoakes
  3 siblings, 1 reply; 61+ messages in thread
From: kernel test robot @ 2025-09-10 17:27 UTC (permalink / raw)
  To: Yafang Shao, akpm, david, ziy, baolin.wang, lorenzo.stoakes,
	Liam.Howlett, npache, ryan.roberts, dev.jain, hannes,
	usamaarif642, gutierrez.asier, willy, ast, daniel, andrii,
	ameryhung, rientjes, corbet, 21cnbao, shakeel.butt
  Cc: llvm, oe-kbuild-all, bpf, linux-mm, linux-doc, Yafang Shao,
	Lance Yang

Hi Yafang,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Yafang-Shao/mm-thp-remove-disabled-task-from-khugepaged_mm_slot/20250910-144850
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20250910024447.64788-2-laoar.shao%40gmail.com
patch subject: [PATCH v7 mm-new 01/10] mm: thp: remove disabled task from khugepaged_mm_slot
config: x86_64-allnoconfig (https://download.01.org/0day-ci/archive/20250911/202509110109.PSgSHb31-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250911/202509110109.PSgSHb31-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202509110109.PSgSHb31-lkp@intel.com/

All errors (new ones prefixed by >>):

>> kernel/sys.c:2500:6: error: call to undeclared function 'hugepage_pmd_enabled'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    2500 |             hugepage_pmd_enabled())
         |             ^
>> kernel/sys.c:2501:3: error: call to undeclared function '__khugepaged_enter'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    2501 |                 __khugepaged_enter(mm);
         |                 ^
   2 errors generated.


vim +/hugepage_pmd_enabled +2500 kernel/sys.c

  2471	
  2472	static int prctl_set_thp_disable(bool thp_disable, unsigned long flags,
  2473					 unsigned long arg4, unsigned long arg5)
  2474	{
  2475		struct mm_struct *mm = current->mm;
  2476	
  2477		if (arg4 || arg5)
  2478			return -EINVAL;
  2479	
  2480		/* Flags are only allowed when disabling. */
  2481		if ((!thp_disable && flags) || (flags & ~PR_THP_DISABLE_EXCEPT_ADVISED))
  2482			return -EINVAL;
  2483		if (mmap_write_lock_killable(current->mm))
  2484			return -EINTR;
  2485		if (thp_disable) {
  2486			if (flags & PR_THP_DISABLE_EXCEPT_ADVISED) {
  2487				mm_flags_clear(MMF_DISABLE_THP_COMPLETELY, mm);
  2488				mm_flags_set(MMF_DISABLE_THP_EXCEPT_ADVISED, mm);
  2489			} else {
  2490				mm_flags_set(MMF_DISABLE_THP_COMPLETELY, mm);
  2491				mm_flags_clear(MMF_DISABLE_THP_EXCEPT_ADVISED, mm);
  2492			}
  2493		} else {
  2494			mm_flags_clear(MMF_DISABLE_THP_COMPLETELY, mm);
  2495			mm_flags_clear(MMF_DISABLE_THP_EXCEPT_ADVISED, mm);
  2496		}
  2497	
  2498		if (!mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm) &&
  2499		    !mm_flags_test(MMF_VM_HUGEPAGE, mm) &&
> 2500		    hugepage_pmd_enabled())
> 2501			__khugepaged_enter(mm);
  2502		mmap_write_unlock(current->mm);
  2503		return 0;
  2504	}
  2505	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 07/10] selftests/bpf: add a simple BPF based THP policy
  2025-09-10  2:44 ` [PATCH v7 mm-new 07/10] selftests/bpf: add a simple BPF based THP policy Yafang Shao
@ 2025-09-10 20:44   ` Alexei Starovoitov
  2025-09-11  2:31     ` Yafang Shao
  0 siblings, 1 reply; 61+ messages in thread
From: Alexei Starovoitov @ 2025-09-10 20:44 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Andrew Morton, David Hildenbrand, ziy, baolin.wang,
	Lorenzo Stoakes, Liam Howlett, npache, ryan.roberts, dev.jain,
	Johannes Weiner, usamaarif642, gutierrez.asier, Matthew Wilcox,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Amery Hung,
	David Rientjes, Jonathan Corbet, 21cnbao, Shakeel Butt, bpf,
	linux-mm, open list:DOCUMENTATION

On Tue, Sep 9, 2025 at 7:46 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> +/* Detecting whether a task can successfully allocate THP is unreliable because
> + * it may be influenced by system memory pressure. Instead of making the result
> + * dependent on unpredictable factors, we should simply check
> + * bpf_hook_thp_get_orders()'s return value, which is deterministic.
> + */
> +SEC("fexit/bpf_hook_thp_get_orders")
> +int BPF_PROG(thp_run, struct vm_area_struct *vma, u64 vma_flags, enum tva_type tva_type,
> +            unsigned long orders, int retval)
> +{

...

> +SEC("struct_ops/thp_get_order")
> +int BPF_PROG(alloc_in_khugepaged, struct vm_area_struct *vma, enum bpf_thp_vma_type vma_type,
> +            enum tva_type tva_type, unsigned long orders)
> +{

This is a bad idea to mix struct_ops logic with fentry/fexit style.
struct_ops hook will not be affected by compiler optimizations,
while fentry depends on a whim of compilers.
struct_ops can be scoped, while fentry is always global.
sched-ext already struggles with the later, since some scheds
need tracing data from other parts of the kernel and they cannot
be grouped together. All sorts of workarounds were proposed, but
no good solution in sight. So don't go this route for THP.
Make everything you need to be struct_ops based and/or pass
whatever extra data into these ops.

Also think of scoping for bpf-thp from the start.
Currently st_ops/thp_get_order is only one and it's global.
It's ok for prototypes and experiments, but not ok for landing upstream.
I think cgroup would a natural scope and different cgroups might
want their own bpf based THP hints. Once you do that, think through
how delegation of suggested order will propagate through hierarchy.

bpf-oom seems to be aligning toward the same design principles,
so don't reinvent the wheel.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 01/10] mm: thp: remove disabled task from khugepaged_mm_slot
  2025-09-10 17:27   ` kernel test robot
@ 2025-09-11  2:12     ` Lance Yang
  2025-09-11  2:28       ` Zi Yan
  0 siblings, 1 reply; 61+ messages in thread
From: Lance Yang @ 2025-09-11  2:12 UTC (permalink / raw)
  To: Yafang Shao
  Cc: llvm, oe-kbuild-all, bpf, linux-mm, linux-doc, Lance Yang, akpm,
	gutierrez.asier, rientjes, andrii, david, ziy, baolin.wang,
	Liam.Howlett, ameryhung, ryan.roberts, lorenzo.stoakes,
	usamaarif642, willy, corbet, npache, dev.jain, 21cnbao,
	shakeel.butt, ast, daniel, hannes, kernel test robot

Hi Yafang,

On 2025/9/11 01:27, kernel test robot wrote:
> Hi Yafang,
> 
> kernel test robot noticed the following build errors:
> 
> [auto build test ERROR on akpm-mm/mm-everything]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/Yafang-Shao/mm-thp-remove-disabled-task-from-khugepaged_mm_slot/20250910-144850
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> patch link:    https://lore.kernel.org/r/20250910024447.64788-2-laoar.shao%40gmail.com
> patch subject: [PATCH v7 mm-new 01/10] mm: thp: remove disabled task from khugepaged_mm_slot
> config: x86_64-allnoconfig (https://download.01.org/0day-ci/archive/20250911/202509110109.PSgSHb31-lkp@intel.com/config)
> compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250911/202509110109.PSgSHb31-lkp@intel.com/reproduce)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202509110109.PSgSHb31-lkp@intel.com/
> 
> All errors (new ones prefixed by >>):
> 
>>> kernel/sys.c:2500:6: error: call to undeclared function 'hugepage_pmd_enabled'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
>      2500 |             hugepage_pmd_enabled())
>           |             ^
>>> kernel/sys.c:2501:3: error: call to undeclared function '__khugepaged_enter'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
>      2501 |                 __khugepaged_enter(mm);
>           |                 ^
>     2 errors generated.

Oops, seems like hugepage_pmd_enabled() and __khugepaged_enter() are only
available when CONFIG_TRANSPARENT_HUGEPAGE is enabled ;)

> 
> 
> vim +/hugepage_pmd_enabled +2500 kernel/sys.c
> 
>    2471	
>    2472	static int prctl_set_thp_disable(bool thp_disable, unsigned long flags,
>    2473					 unsigned long arg4, unsigned long arg5)
>    2474	{
>    2475		struct mm_struct *mm = current->mm;
>    2476	
>    2477		if (arg4 || arg5)
>    2478			return -EINVAL;
>    2479	
>    2480		/* Flags are only allowed when disabling. */
>    2481		if ((!thp_disable && flags) || (flags & ~PR_THP_DISABLE_EXCEPT_ADVISED))
>    2482			return -EINVAL;
>    2483		if (mmap_write_lock_killable(current->mm))
>    2484			return -EINTR;
>    2485		if (thp_disable) {
>    2486			if (flags & PR_THP_DISABLE_EXCEPT_ADVISED) {
>    2487				mm_flags_clear(MMF_DISABLE_THP_COMPLETELY, mm);
>    2488				mm_flags_set(MMF_DISABLE_THP_EXCEPT_ADVISED, mm);
>    2489			} else {
>    2490				mm_flags_set(MMF_DISABLE_THP_COMPLETELY, mm);
>    2491				mm_flags_clear(MMF_DISABLE_THP_EXCEPT_ADVISED, mm);
>    2492			}
>    2493		} else {
>    2494			mm_flags_clear(MMF_DISABLE_THP_COMPLETELY, mm);
>    2495			mm_flags_clear(MMF_DISABLE_THP_EXCEPT_ADVISED, mm);
>    2496		}
>    2497	
>    2498		if (!mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm) &&
>    2499		    !mm_flags_test(MMF_VM_HUGEPAGE, mm) &&
>> 2500		    hugepage_pmd_enabled())
>> 2501			__khugepaged_enter(mm);
>    2502		mmap_write_unlock(current->mm);
>    2503		return 0;
>    2504	}
>    2505	

So, let's wrap the new logic in an #ifdef CONFIG_TRANSPARENT_HUGEPAGE block.

diff --git a/kernel/sys.c b/kernel/sys.c
index a1c1e8007f2d..c8600e017933 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2495,10 +2495,13 @@ static int prctl_set_thp_disable(bool 
thp_disable, unsigned long flags,
                 mm_flags_clear(MMF_DISABLE_THP_EXCEPT_ADVISED, mm);
         }

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
         if (!mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm) &&
             !mm_flags_test(MMF_VM_HUGEPAGE, mm) &&
             hugepage_pmd_enabled())
                 __khugepaged_enter(mm);
+#endif
+
         mmap_write_unlock(current->mm);
         return 0;
  }

Cheers,
Lance


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 01/10] mm: thp: remove disabled task from khugepaged_mm_slot
  2025-09-11  2:12     ` Lance Yang
@ 2025-09-11  2:28       ` Zi Yan
  2025-09-11  2:35         ` Yafang Shao
                           ` (2 more replies)
  0 siblings, 3 replies; 61+ messages in thread
From: Zi Yan @ 2025-09-11  2:28 UTC (permalink / raw)
  To: Lance Yang
  Cc: Yafang Shao, llvm, oe-kbuild-all, bpf, linux-mm, linux-doc,
	Lance Yang, akpm, gutierrez.asier, rientjes, andrii, david,
	baolin.wang, Liam.Howlett, ameryhung, ryan.roberts,
	lorenzo.stoakes, usamaarif642, willy, corbet, npache, dev.jain,
	21cnbao, shakeel.butt, ast, daniel, hannes, kernel test robot

On 10 Sep 2025, at 22:12, Lance Yang wrote:

> Hi Yafang,
>
> On 2025/9/11 01:27, kernel test robot wrote:
>> Hi Yafang,
>>
>> kernel test robot noticed the following build errors:
>>
>> [auto build test ERROR on akpm-mm/mm-everything]
>>
>> url:    https://github.com/intel-lab-lkp/linux/commits/Yafang-Shao/mm-thp-remove-disabled-task-from-khugepaged_mm_slot/20250910-144850
>> base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
>> patch link:    https://lore.kernel.org/r/20250910024447.64788-2-laoar.shao%40gmail.com
>> patch subject: [PATCH v7 mm-new 01/10] mm: thp: remove disabled task from khugepaged_mm_slot
>> config: x86_64-allnoconfig (https://download.01.org/0day-ci/archive/20250911/202509110109.PSgSHb31-lkp@intel.com/config)
>> compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
>> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250911/202509110109.PSgSHb31-lkp@intel.com/reproduce)
>>
>> If you fix the issue in a separate patch/commit (i.e. not just a new version of
>> the same patch/commit), kindly add following tags
>> | Reported-by: kernel test robot <lkp@intel.com>
>> | Closes: https://lore.kernel.org/oe-kbuild-all/202509110109.PSgSHb31-lkp@intel.com/
>>
>> All errors (new ones prefixed by >>):
>>
>>>> kernel/sys.c:2500:6: error: call to undeclared function 'hugepage_pmd_enabled'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
>>      2500 |             hugepage_pmd_enabled())
>>           |             ^
>>>> kernel/sys.c:2501:3: error: call to undeclared function '__khugepaged_enter'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
>>      2501 |                 __khugepaged_enter(mm);
>>           |                 ^
>>     2 errors generated.
>
> Oops, seems like hugepage_pmd_enabled() and __khugepaged_enter() are only
> available when CONFIG_TRANSPARENT_HUGEPAGE is enabled ;)
>
>>
>>
>> vim +/hugepage_pmd_enabled +2500 kernel/sys.c
>>
>>    2471	
>>    2472	static int prctl_set_thp_disable(bool thp_disable, unsigned long flags,
>>    2473					 unsigned long arg4, unsigned long arg5)
>>    2474	{
>>    2475		struct mm_struct *mm = current->mm;
>>    2476	
>>    2477		if (arg4 || arg5)
>>    2478			return -EINVAL;
>>    2479	
>>    2480		/* Flags are only allowed when disabling. */
>>    2481		if ((!thp_disable && flags) || (flags & ~PR_THP_DISABLE_EXCEPT_ADVISED))
>>    2482			return -EINVAL;
>>    2483		if (mmap_write_lock_killable(current->mm))
>>    2484			return -EINTR;
>>    2485		if (thp_disable) {
>>    2486			if (flags & PR_THP_DISABLE_EXCEPT_ADVISED) {
>>    2487				mm_flags_clear(MMF_DISABLE_THP_COMPLETELY, mm);
>>    2488				mm_flags_set(MMF_DISABLE_THP_EXCEPT_ADVISED, mm);
>>    2489			} else {
>>    2490				mm_flags_set(MMF_DISABLE_THP_COMPLETELY, mm);
>>    2491				mm_flags_clear(MMF_DISABLE_THP_EXCEPT_ADVISED, mm);
>>    2492			}
>>    2493		} else {
>>    2494			mm_flags_clear(MMF_DISABLE_THP_COMPLETELY, mm);
>>    2495			mm_flags_clear(MMF_DISABLE_THP_EXCEPT_ADVISED, mm);
>>    2496		}
>>    2497	
>>    2498		if (!mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm) &&
>>    2499		    !mm_flags_test(MMF_VM_HUGEPAGE, mm) &&
>>> 2500		    hugepage_pmd_enabled())
>>> 2501			__khugepaged_enter(mm);
>>    2502		mmap_write_unlock(current->mm);
>>    2503		return 0;
>>    2504	}
>>    2505	
>
> So, let's wrap the new logic in an #ifdef CONFIG_TRANSPARENT_HUGEPAGE block.
>
> diff --git a/kernel/sys.c b/kernel/sys.c
> index a1c1e8007f2d..c8600e017933 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2495,10 +2495,13 @@ static int prctl_set_thp_disable(bool thp_disable, unsigned long flags,
>                 mm_flags_clear(MMF_DISABLE_THP_EXCEPT_ADVISED, mm);
>         }
>
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>         if (!mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm) &&
>             !mm_flags_test(MMF_VM_HUGEPAGE, mm) &&
>             hugepage_pmd_enabled())
>                 __khugepaged_enter(mm);
> +#endif
> +
>         mmap_write_unlock(current->mm);
>         return 0;
>  }

Or in the header file,

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
...
#else
bool hugepage_pmd_enabled()
{
	return false;
}

int __khugepaged_enter(struct mm_struct *mm)
{
	return 0;
}
#endif

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 07/10] selftests/bpf: add a simple BPF based THP policy
  2025-09-10 20:44   ` Alexei Starovoitov
@ 2025-09-11  2:31     ` Yafang Shao
  0 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2025-09-11  2:31 UTC (permalink / raw)
  To: Alexei Starovoitov, Johannes Weiner, Tejun Heo
  Cc: Andrew Morton, David Hildenbrand, ziy, baolin.wang,
	Lorenzo Stoakes, Liam Howlett, npache, ryan.roberts, dev.jain,
	usamaarif642, gutierrez.asier, Matthew Wilcox, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Amery Hung, David Rientjes,
	Jonathan Corbet, 21cnbao, Shakeel Butt, bpf, linux-mm,
	open list:DOCUMENTATION

On Thu, Sep 11, 2025 at 4:44 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Sep 9, 2025 at 7:46 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > +/* Detecting whether a task can successfully allocate THP is unreliable because
> > + * it may be influenced by system memory pressure. Instead of making the result
> > + * dependent on unpredictable factors, we should simply check
> > + * bpf_hook_thp_get_orders()'s return value, which is deterministic.
> > + */
> > +SEC("fexit/bpf_hook_thp_get_orders")
> > +int BPF_PROG(thp_run, struct vm_area_struct *vma, u64 vma_flags, enum tva_type tva_type,
> > +            unsigned long orders, int retval)
> > +{
>
> ...
>
> > +SEC("struct_ops/thp_get_order")
> > +int BPF_PROG(alloc_in_khugepaged, struct vm_area_struct *vma, enum bpf_thp_vma_type vma_type,
> > +            enum tva_type tva_type, unsigned long orders)
> > +{
>
> This is a bad idea to mix struct_ops logic with fentry/fexit style.
> struct_ops hook will not be affected by compiler optimizations,
> while fentry depends on a whim of compilers.
> struct_ops can be scoped, while fentry is always global.
> sched-ext already struggles with the later, since some scheds
> need tracing data from other parts of the kernel and they cannot
> be grouped together. All sorts of workarounds were proposed, but
> no good solution in sight. So don't go this route for THP.
> Make everything you need to be struct_ops based and/or pass
> whatever extra data into these ops.

will change it.

>
> Also think of scoping for bpf-thp from the start.
> Currently st_ops/thp_get_order is only one and it's global.
> It's ok for prototypes and experiments, but not ok for landing upstream.
> I think cgroup would a natural scope and different cgroups might
> want their own bpf based THP hints. Once you do that, think through
> how delegation of suggested order will propagate through hierarchy.

+ Tejun

As Johannes Weiner previously explained [[0]], cgroups are designed as
nested hierarchies for partitioning resources. They are a poor fit for
enforcing arbitrary, non-hierarchical policies.

: Cgroups are for nested trees dividing up resources. They're not a good
: fit for arbitrary, non-hierarchical policy settings.

[0] https://lore.kernel.org/linux-mm/20250430175954.GD2020@cmpxchg.org/

The THP  policy is a quintessential example of such an arbitrary
setting. Even within a single cgroup, it is often necessary to enable
THP for performance-critical tasks while disabling it for others to
avoid latency spikes. Implementing this policy through a cgroup
interface that propagates hierarchically would eliminate the crucial
ability to configure it on a per-task basis.

While the bpf-thp mechanism has a global scope, this does not limit
its application to a single system-wide policy. In contrast to a
hierarchical cgroup-based setting, bpf-thp offers the flexibility to
set policies per-task, per-cgroup, or globally. Fundamentally, it is a
more powerful variant of prctl(), not a variant of cgroup interface
file.

>
> bpf-oom seems to be aligning toward the same design principles,
> so don't reinvent the wheel.

Since bpf-oom's role is to select a task to kill from within **a
defined group of tasks**, it is inherently well-suited for
cgroup-based management.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 01/10] mm: thp: remove disabled task from khugepaged_mm_slot
  2025-09-11  2:28       ` Zi Yan
@ 2025-09-11  2:35         ` Yafang Shao
  2025-09-11  2:38         ` Lance Yang
  2025-09-11 13:47         ` Lorenzo Stoakes
  2 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2025-09-11  2:35 UTC (permalink / raw)
  To: Zi Yan, Lance Yang
  Cc: llvm, oe-kbuild-all, bpf, linux-mm, linux-doc, Lance Yang, akpm,
	gutierrez.asier, rientjes, andrii, david, baolin.wang,
	Liam.Howlett, ameryhung, ryan.roberts, lorenzo.stoakes,
	usamaarif642, willy, corbet, npache, dev.jain, 21cnbao,
	shakeel.butt, ast, daniel, hannes, kernel test robot

On Thu, Sep 11, 2025 at 10:28 AM Zi Yan <ziy@nvidia.com> wrote:
>
> On 10 Sep 2025, at 22:12, Lance Yang wrote:
>
> > Hi Yafang,
> >
> > On 2025/9/11 01:27, kernel test robot wrote:
> >> Hi Yafang,
> >>
> >> kernel test robot noticed the following build errors:
> >>
> >> [auto build test ERROR on akpm-mm/mm-everything]
> >>
> >> url:    https://github.com/intel-lab-lkp/linux/commits/Yafang-Shao/mm-thp-remove-disabled-task-from-khugepaged_mm_slot/20250910-144850
> >> base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> >> patch link:    https://lore.kernel.org/r/20250910024447.64788-2-laoar.shao%40gmail.com
> >> patch subject: [PATCH v7 mm-new 01/10] mm: thp: remove disabled task from khugepaged_mm_slot
> >> config: x86_64-allnoconfig (https://download.01.org/0day-ci/archive/20250911/202509110109.PSgSHb31-lkp@intel.com/config)
> >> compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
> >> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250911/202509110109.PSgSHb31-lkp@intel.com/reproduce)
> >>
> >> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> >> the same patch/commit), kindly add following tags
> >> | Reported-by: kernel test robot <lkp@intel.com>
> >> | Closes: https://lore.kernel.org/oe-kbuild-all/202509110109.PSgSHb31-lkp@intel.com/
> >>
> >> All errors (new ones prefixed by >>):
> >>
> >>>> kernel/sys.c:2500:6: error: call to undeclared function 'hugepage_pmd_enabled'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
> >>      2500 |             hugepage_pmd_enabled())
> >>           |             ^
> >>>> kernel/sys.c:2501:3: error: call to undeclared function '__khugepaged_enter'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
> >>      2501 |                 __khugepaged_enter(mm);
> >>           |                 ^
> >>     2 errors generated.
> >
> > Oops, seems like hugepage_pmd_enabled() and __khugepaged_enter() are only
> > available when CONFIG_TRANSPARENT_HUGEPAGE is enabled ;)
> >
> >>
> >>
> >> vim +/hugepage_pmd_enabled +2500 kernel/sys.c
> >>
> >>    2471
> >>    2472      static int prctl_set_thp_disable(bool thp_disable, unsigned long flags,
> >>    2473                                       unsigned long arg4, unsigned long arg5)
> >>    2474      {
> >>    2475              struct mm_struct *mm = current->mm;
> >>    2476
> >>    2477              if (arg4 || arg5)
> >>    2478                      return -EINVAL;
> >>    2479
> >>    2480              /* Flags are only allowed when disabling. */
> >>    2481              if ((!thp_disable && flags) || (flags & ~PR_THP_DISABLE_EXCEPT_ADVISED))
> >>    2482                      return -EINVAL;
> >>    2483              if (mmap_write_lock_killable(current->mm))
> >>    2484                      return -EINTR;
> >>    2485              if (thp_disable) {
> >>    2486                      if (flags & PR_THP_DISABLE_EXCEPT_ADVISED) {
> >>    2487                              mm_flags_clear(MMF_DISABLE_THP_COMPLETELY, mm);
> >>    2488                              mm_flags_set(MMF_DISABLE_THP_EXCEPT_ADVISED, mm);
> >>    2489                      } else {
> >>    2490                              mm_flags_set(MMF_DISABLE_THP_COMPLETELY, mm);
> >>    2491                              mm_flags_clear(MMF_DISABLE_THP_EXCEPT_ADVISED, mm);
> >>    2492                      }
> >>    2493              } else {
> >>    2494                      mm_flags_clear(MMF_DISABLE_THP_COMPLETELY, mm);
> >>    2495                      mm_flags_clear(MMF_DISABLE_THP_EXCEPT_ADVISED, mm);
> >>    2496              }
> >>    2497
> >>    2498              if (!mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm) &&
> >>    2499                  !mm_flags_test(MMF_VM_HUGEPAGE, mm) &&
> >>> 2500                    hugepage_pmd_enabled())
> >>> 2501                        __khugepaged_enter(mm);
> >>    2502              mmap_write_unlock(current->mm);
> >>    2503              return 0;
> >>    2504      }
> >>    2505
> >
> > So, let's wrap the new logic in an #ifdef CONFIG_TRANSPARENT_HUGEPAGE block.
> >
> > diff --git a/kernel/sys.c b/kernel/sys.c
> > index a1c1e8007f2d..c8600e017933 100644
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -2495,10 +2495,13 @@ static int prctl_set_thp_disable(bool thp_disable, unsigned long flags,
> >                 mm_flags_clear(MMF_DISABLE_THP_EXCEPT_ADVISED, mm);
> >         }
> >
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >         if (!mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm) &&
> >             !mm_flags_test(MMF_VM_HUGEPAGE, mm) &&
> >             hugepage_pmd_enabled())
> >                 __khugepaged_enter(mm);
> > +#endif
> > +
> >         mmap_write_unlock(current->mm);
> >         return 0;
> >  }
>
> Or in the header file,
>
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> ...
> #else
> bool hugepage_pmd_enabled()
> {
>         return false;
> }
>
> int __khugepaged_enter(struct mm_struct *mm)
> {
>         return 0;
> }
> #endif

Thank you, both. I will address this in the next version.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 01/10] mm: thp: remove disabled task from khugepaged_mm_slot
  2025-09-11  2:28       ` Zi Yan
  2025-09-11  2:35         ` Yafang Shao
@ 2025-09-11  2:38         ` Lance Yang
  2025-09-11 13:47         ` Lorenzo Stoakes
  2 siblings, 0 replies; 61+ messages in thread
From: Lance Yang @ 2025-09-11  2:38 UTC (permalink / raw)
  To: Zi Yan
  Cc: Yafang Shao, llvm, oe-kbuild-all, bpf, linux-mm, linux-doc,
	Lance Yang, akpm, gutierrez.asier, rientjes, andrii, david,
	baolin.wang, Liam.Howlett, ameryhung, ryan.roberts,
	lorenzo.stoakes, usamaarif642, willy, corbet, npache, dev.jain,
	21cnbao, shakeel.butt, ast, daniel, hannes, kernel test robot



On 2025/9/11 10:28, Zi Yan wrote:
> On 10 Sep 2025, at 22:12, Lance Yang wrote:
> 
>> Hi Yafang,
>>
>> On 2025/9/11 01:27, kernel test robot wrote:
>>> Hi Yafang,
>>>
>>> kernel test robot noticed the following build errors:
>>>
>>> [auto build test ERROR on akpm-mm/mm-everything]
>>>
>>> url:    https://github.com/intel-lab-lkp/linux/commits/Yafang-Shao/mm-thp-remove-disabled-task-from-khugepaged_mm_slot/20250910-144850
>>> base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
>>> patch link:    https://lore.kernel.org/r/20250910024447.64788-2-laoar.shao%40gmail.com
>>> patch subject: [PATCH v7 mm-new 01/10] mm: thp: remove disabled task from khugepaged_mm_slot
>>> config: x86_64-allnoconfig (https://download.01.org/0day-ci/archive/20250911/202509110109.PSgSHb31-lkp@intel.com/config)
>>> compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
>>> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250911/202509110109.PSgSHb31-lkp@intel.com/reproduce)
>>>
>>> If you fix the issue in a separate patch/commit (i.e. not just a new version of
>>> the same patch/commit), kindly add following tags
>>> | Reported-by: kernel test robot <lkp@intel.com>
>>> | Closes: https://lore.kernel.org/oe-kbuild-all/202509110109.PSgSHb31-lkp@intel.com/
>>>
>>> All errors (new ones prefixed by >>):
>>>
>>>>> kernel/sys.c:2500:6: error: call to undeclared function 'hugepage_pmd_enabled'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
>>>       2500 |             hugepage_pmd_enabled())
>>>            |             ^
>>>>> kernel/sys.c:2501:3: error: call to undeclared function '__khugepaged_enter'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
>>>       2501 |                 __khugepaged_enter(mm);
>>>            |                 ^
>>>      2 errors generated.
>>
>> Oops, seems like hugepage_pmd_enabled() and __khugepaged_enter() are only
>> available when CONFIG_TRANSPARENT_HUGEPAGE is enabled ;)
>>
>>>
>>>
>>> vim +/hugepage_pmd_enabled +2500 kernel/sys.c
>>>
>>>     2471	
>>>     2472	static int prctl_set_thp_disable(bool thp_disable, unsigned long flags,
>>>     2473					 unsigned long arg4, unsigned long arg5)
>>>     2474	{
>>>     2475		struct mm_struct *mm = current->mm;
>>>     2476	
>>>     2477		if (arg4 || arg5)
>>>     2478			return -EINVAL;
>>>     2479	
>>>     2480		/* Flags are only allowed when disabling. */
>>>     2481		if ((!thp_disable && flags) || (flags & ~PR_THP_DISABLE_EXCEPT_ADVISED))
>>>     2482			return -EINVAL;
>>>     2483		if (mmap_write_lock_killable(current->mm))
>>>     2484			return -EINTR;
>>>     2485		if (thp_disable) {
>>>     2486			if (flags & PR_THP_DISABLE_EXCEPT_ADVISED) {
>>>     2487				mm_flags_clear(MMF_DISABLE_THP_COMPLETELY, mm);
>>>     2488				mm_flags_set(MMF_DISABLE_THP_EXCEPT_ADVISED, mm);
>>>     2489			} else {
>>>     2490				mm_flags_set(MMF_DISABLE_THP_COMPLETELY, mm);
>>>     2491				mm_flags_clear(MMF_DISABLE_THP_EXCEPT_ADVISED, mm);
>>>     2492			}
>>>     2493		} else {
>>>     2494			mm_flags_clear(MMF_DISABLE_THP_COMPLETELY, mm);
>>>     2495			mm_flags_clear(MMF_DISABLE_THP_EXCEPT_ADVISED, mm);
>>>     2496		}
>>>     2497	
>>>     2498		if (!mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm) &&
>>>     2499		    !mm_flags_test(MMF_VM_HUGEPAGE, mm) &&
>>>> 2500		    hugepage_pmd_enabled())
>>>> 2501			__khugepaged_enter(mm);
>>>     2502		mmap_write_unlock(current->mm);
>>>     2503		return 0;
>>>     2504	}
>>>     2505	
>>
>> So, let's wrap the new logic in an #ifdef CONFIG_TRANSPARENT_HUGEPAGE block.
>>
>> diff --git a/kernel/sys.c b/kernel/sys.c
>> index a1c1e8007f2d..c8600e017933 100644
>> --- a/kernel/sys.c
>> +++ b/kernel/sys.c
>> @@ -2495,10 +2495,13 @@ static int prctl_set_thp_disable(bool thp_disable, unsigned long flags,
>>                  mm_flags_clear(MMF_DISABLE_THP_EXCEPT_ADVISED, mm);
>>          }
>>
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>          if (!mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm) &&
>>              !mm_flags_test(MMF_VM_HUGEPAGE, mm) &&
>>              hugepage_pmd_enabled())
>>                  __khugepaged_enter(mm);
>> +#endif
>> +
>>          mmap_write_unlock(current->mm);
>>          return 0;
>>   }
> 
> Or in the header file,
> 
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> ...
> #else
> bool hugepage_pmd_enabled()
> {
> 	return false;
> }
> 
> int __khugepaged_enter(struct mm_struct *mm)
> {
> 	return 0;
> }
> #endif

Nice. That's a much better approach.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 02/10] mm: thp: add support for BPF based THP order selection
  2025-09-10 13:56       ` Lance Yang
@ 2025-09-11  2:48         ` Yafang Shao
  2025-09-11  3:04           ` Lance Yang
  2025-09-11 14:45         ` Lorenzo Stoakes
  1 sibling, 1 reply; 61+ messages in thread
From: Yafang Shao @ 2025-09-11  2:48 UTC (permalink / raw)
  To: Lance Yang
  Cc: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, 21cnbao, shakeel.butt, bpf, linux-mm, linux-doc

On Wed, Sep 10, 2025 at 9:57 PM Lance Yang <lance.yang@linux.dev> wrote:
>
>
>
> On 2025/9/10 20:54, Lance Yang wrote:
> > On Wed, Sep 10, 2025 at 8:42 PM Lance Yang <lance.yang@linux.dev> wrote:
> >>
> >> Hey Yafang,
> >>
> >> On Wed, Sep 10, 2025 at 10:53 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >>>
> >>> This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
> >>> THP tuning. It includes a hook bpf_hook_thp_get_order(), allowing BPF
> >>> programs to influence THP order selection based on factors such as:
> >>> - Workload identity
> >>>    For example, workloads running in specific containers or cgroups.
> >>> - Allocation context
> >>>    Whether the allocation occurs during a page fault, khugepaged, swap or
> >>>    other paths.
> >>> - VMA's memory advice settings
> >>>    MADV_HUGEPAGE or MADV_NOHUGEPAGE
> >>> - Memory pressure
> >>>    PSI system data or associated cgroup PSI metrics
> >>>
> >>> The kernel API of this new BPF hook is as follows,
> >>>
> >>> /**
> >>>   * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation
> >>>   * @vma: vm_area_struct associated with the THP allocation
> >>>   * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
> >>>   *            BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if
> >>>   *            neither is set.
> >>>   * @tva_type: TVA type for current @vma
> >>>   * @orders: Bitmask of requested THP orders for this allocation
> >>>   *          - PMD-mapped allocation if PMD_ORDER is set
> >>>   *          - mTHP allocation otherwise
> >>>   *
> >>>   * Return: The suggested THP order from the BPF program for allocation. It will
> >>>   *         not exceed the highest requested order in @orders. Return -1 to
> >>>   *         indicate that the original requested @orders should remain unchanged.
> >>>   */
> >>> typedef int thp_order_fn_t(struct vm_area_struct *vma,
> >>>                             enum bpf_thp_vma_type vma_type,
> >>>                             enum tva_type tva_type,
> >>>                             unsigned long orders);
> >>>
> >>> Only a single BPF program can be attached at any given time, though it can
> >>> be dynamically updated to adjust the policy. The implementation supports
> >>> anonymous THP, shmem THP, and mTHP, with future extensions planned for
> >>> file-backed THP.
> >>>
> >>> This functionality is only active when system-wide THP is configured to
> >>> madvise or always mode. It remains disabled in never mode. Additionally,
> >>> if THP is explicitly disabled for a specific task via prctl(), this BPF
> >>> functionality will also be unavailable for that task.
> >>>
> >>> This feature requires CONFIG_BPF_GET_THP_ORDER (marked EXPERIMENTAL) to be
> >>> enabled. Note that this capability is currently unstable and may undergo
> >>> significant changes—including potential removal—in future kernel versions.
> >>>
> >>> Suggested-by: David Hildenbrand <david@redhat.com>
> >>> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >>> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> >>> ---
> >> [...]
> >>> diff --git a/mm/huge_memory_bpf.c b/mm/huge_memory_bpf.c
> >>> new file mode 100644
> >>> index 000000000000..525ee22ab598
> >>> --- /dev/null
> >>> +++ b/mm/huge_memory_bpf.c
> >>> @@ -0,0 +1,243 @@
> >>> +// SPDX-License-Identifier: GPL-2.0
> >>> +/*
> >>> + * BPF-based THP policy management
> >>> + *
> >>> + * Author: Yafang Shao <laoar.shao@gmail.com>
> >>> + */
> >>> +
> >>> +#include <linux/bpf.h>
> >>> +#include <linux/btf.h>
> >>> +#include <linux/huge_mm.h>
> >>> +#include <linux/khugepaged.h>
> >>> +
> >>> +enum bpf_thp_vma_type {
> >>> +       BPF_THP_VM_NONE = 0,
> >>> +       BPF_THP_VM_HUGEPAGE,    /* VM_HUGEPAGE */
> >>> +       BPF_THP_VM_NOHUGEPAGE,  /* VM_NOHUGEPAGE */
> >>> +};
> >>> +
> >>> +/**
> >>> + * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation
> >>> + * @vma: vm_area_struct associated with the THP allocation
> >>> + * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
> >>> + *            BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if
> >>> + *            neither is set.
> >>> + * @tva_type: TVA type for current @vma
> >>> + * @orders: Bitmask of requested THP orders for this allocation
> >>> + *          - PMD-mapped allocation if PMD_ORDER is set
> >>> + *          - mTHP allocation otherwise
> >>> + *
> >>> + * Return: The suggested THP order from the BPF program for allocation. It will
> >>> + *         not exceed the highest requested order in @orders. Return -1 to
> >>> + *         indicate that the original requested @orders should remain unchanged.
> >>
> >> A minor documentation nit: the comment says "Return -1 to indicate that the
> >> original requested @orders should remain unchanged". It might be slightly
> >> clearer to say "Return a negative value to fall back to the original
> >> behavior". This would cover all error codes as well ;)

will change it.

> >>
> >>> + */
> >>> +typedef int thp_order_fn_t(struct vm_area_struct *vma,
> >>> +                          enum bpf_thp_vma_type vma_type,
> >>> +                          enum tva_type tva_type,
> >>> +                          unsigned long orders);
> >>
> >> Sorry if I'm missing some context here since I haven't tracked the whole
> >> series closely.
> >>
> >> Regarding the return value for thp_order_fn_t: right now it returns a
> >> single int order. I was thinking, what if we let it return an unsigned
> >> long bitmask of orders instead? This seems like it would be more flexible
> >> down the road, especially if we get more mTHP sizes to choose from. It
> >> would also make the API more consistent, as bpf_hook_thp_get_orders()
> >> itself returns an unsigned long ;)
> >
> > I just realized a flaw in my previous suggestion :(
> >
> > Changing the return type of thp_order_fn_t to unsigned long for consistency
> > and flexibility. However, I completely overlooked that this would prevent
> > the BPF program from returning negative error codes ...
> >
> > Thanks,
> > Lance
> >
> >>
> >> Also, for future extensions, it might be a good idea to add a reserved
> >> flags argument to the thp_order_fn_t signature.
> >>
> >> For example thp_order_fn_t(..., unsigned long flags).
> >>
> >> This would give us aforward-compatible way to add new semantics later
> >> without breaking the ABI and needing a v2. We could just require it to be
> >> 0 for now.

That makes sense. However, as Lorenzo mentioned previously, we should
keep the interface as minimal as possible.

> >>
> >> Thanks for the great work!
> >> Lance
>
>
> Forgot to add:
>
> Noticed that if the hook returns 0, bpf_hook_thp_get_orders() falls
> back to 'orders', preventing us from dynamically disabling mTHP
> allocations.

Could you please clarify what you mean by that?

+       thp_order = bpf_hook_thp_get_order(vma, vma_type, tva_type, orders);
+       if (thp_order < 0)
+               goto out;

In my implementation, it only falls back to @orders if the return
value is negative. If the return value is 0, it uses BIT(0):

+       if (thp_order <= highest_order(orders))
+               thp_orders = BIT(thp_order);

>
> Honoring a return of 0 is critical for our use case, which is to
> dynamically disable mTHP for low-priority containers when memory gets
> low in mixed workloads.
>
> And then re-enable it for them when memory is back above the low
> watermark.

Thank you for detailing your use case; that context is very helpful.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 02/10] mm: thp: add support for BPF based THP order selection
  2025-09-11  2:48         ` Yafang Shao
@ 2025-09-11  3:04           ` Lance Yang
  0 siblings, 0 replies; 61+ messages in thread
From: Lance Yang @ 2025-09-11  3:04 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, 21cnbao, shakeel.butt, bpf, linux-mm, linux-doc



On 2025/9/11 10:48, Yafang Shao wrote:
> On Wed, Sep 10, 2025 at 9:57 PM Lance Yang <lance.yang@linux.dev> wrote:
[...]
>>>>> +/**
>>>>> + * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation
>>>>> + * @vma: vm_area_struct associated with the THP allocation
>>>>> + * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
>>>>> + *            BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if
>>>>> + *            neither is set.
>>>>> + * @tva_type: TVA type for current @vma
>>>>> + * @orders: Bitmask of requested THP orders for this allocation
>>>>> + *          - PMD-mapped allocation if PMD_ORDER is set
>>>>> + *          - mTHP allocation otherwise
>>>>> + *
>>>>> + * Return: The suggested THP order from the BPF program for allocation. It will
>>>>> + *         not exceed the highest requested order in @orders. Return -1 to
>>>>> + *         indicate that the original requested @orders should remain unchanged.
>>>>
>>>> A minor documentation nit: the comment says "Return -1 to indicate that the
>>>> original requested @orders should remain unchanged". It might be slightly
>>>> clearer to say "Return a negative value to fall back to the original
>>>> behavior". This would cover all error codes as well ;)
> 
> will change it.

Please feel free to change it ;)

> 
>>>>
[...]
>>>>
>>>> Also, for future extensions, it might be a good idea to add a reserved
>>>> flags argument to the thp_order_fn_t signature.
>>>>
>>>> For example thp_order_fn_t(..., unsigned long flags).
>>>>
>>>> This would give us aforward-compatible way to add new semantics later
>>>> without breaking the ABI and needing a v2. We could just require it to be
>>>> 0 for now.
> 
> That makes sense. However, as Lorenzo mentioned previously, we should
> keep the interface as minimal as possible.

Got it.

[...]
>> Forgot to add:
>>
>> Noticed that if the hook returns 0, bpf_hook_thp_get_orders() falls
>> back to 'orders', preventing us from dynamically disabling mTHP
>> allocations.
> 
> Could you please clarify what you mean by that?
> 
> +       thp_order = bpf_hook_thp_get_order(vma, vma_type, tva_type, orders);
> +       if (thp_order < 0)
> +               goto out;
> 
> In my implementation, it only falls back to @orders if the return
> value is negative. If the return value is 0, it uses BIT(0):

My bad, I completely misread the code last night ...

I see now that returning 0 forces a base page (order-0)

> 
> +       if (thp_order <= highest_order(orders))
> +               thp_orders = BIT(thp_order);

Yes, this is exactly the behavior we need. It will allow us to dynamically
disable mTHP for low-priority containers when we need to, which is perfect
for our use case!

> 
>>
>> Honoring a return of 0 is critical for our use case, which is to
>> dynamically disable mTHP for low-priority containers when memory gets
>> low in mixed workloads.
>>
>> And then re-enable it for them when memory is back above the low
>> watermark.
> 
> Thank you for detailing your use case; that context is very helpful.

Cheers,
Lance


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 01/10] mm: thp: remove disabled task from khugepaged_mm_slot
  2025-09-10  2:44 ` [PATCH v7 mm-new 01/10] mm: thp: remove disabled task from khugepaged_mm_slot Yafang Shao
                     ` (2 preceding siblings ...)
  2025-09-10 17:27   ` kernel test robot
@ 2025-09-11 13:43   ` Lorenzo Stoakes
  2025-09-14  2:47     ` Yafang Shao
  3 siblings, 1 reply; 61+ messages in thread
From: Lorenzo Stoakes @ 2025-09-11 13:43 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, 21cnbao,
	shakeel.butt, bpf, linux-mm, linux-doc, Lance Yang

On Wed, Sep 10, 2025 at 10:44:38AM +0800, Yafang Shao wrote:
> Since a task with MMF_DISABLE_THP_COMPLETELY cannot use THP, remove it from
> the khugepaged_mm_slot to stop khugepaged from processing it.
>
> After this change, the following semantic relationship always holds:
>
>   MMF_VM_HUGEPAGE is set     == task is in khugepaged mm_slot
>   MMF_VM_HUGEPAGE is not set == task is not in khugepaged mm_slot
>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> Cc: Lance Yang <ioworker0@gmail.com>

(Obviously on basis of fixing issue bot reported).

> ---
>  include/linux/khugepaged.h |  1 +
>  kernel/sys.c               |  6 ++++++
>  mm/khugepaged.c            | 19 +++++++++----------
>  3 files changed, 16 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> index eb1946a70cff..6cb9107f1006 100644
> --- a/include/linux/khugepaged.h
> +++ b/include/linux/khugepaged.h
> @@ -19,6 +19,7 @@ extern void khugepaged_min_free_kbytes_update(void);
>  extern bool current_is_khugepaged(void);
>  extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>  				   bool install_pmd);
> +bool hugepage_pmd_enabled(void);

Need to provide a !CONFIG_TRANSPARENT_HUGEPAGE version, or to not invoke
this in a context where CONFIG_TRANSPARENT_HUGEPAGE is specified.

>
>  static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
>  {
> diff --git a/kernel/sys.c b/kernel/sys.c
> index a46d9b75880b..a1c1e8007f2d 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -8,6 +8,7 @@
>  #include <linux/export.h>
>  #include <linux/mm.h>
>  #include <linux/mm_inline.h>
> +#include <linux/khugepaged.h>
>  #include <linux/utsname.h>
>  #include <linux/mman.h>
>  #include <linux/reboot.h>
> @@ -2493,6 +2494,11 @@ static int prctl_set_thp_disable(bool thp_disable, unsigned long flags,
>  		mm_flags_clear(MMF_DISABLE_THP_COMPLETELY, mm);
>  		mm_flags_clear(MMF_DISABLE_THP_EXCEPT_ADVISED, mm);
>  	}
> +
> +	if (!mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm) &&
> +	    !mm_flags_test(MMF_VM_HUGEPAGE, mm) &&
> +	    hugepage_pmd_enabled())
> +		__khugepaged_enter(mm);

Let's refactor this so it's not open-coded.

We can have:

void khugepaged_enter_mm(struct mm_struct *mm)
{
	if (mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm))
		return;
	if (mm_flags_test(MMF_VM_HUGEPAGE, mm))
		return;
	if (!hugepage_pmd_enabled())
		return;

	__khugepaged_enter(mm);
}

void khugepaged_enter_vma(struct vm_area_struct *vma,
			  vm_flags_t vm_flags)
{
	if (!thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
		return;

	khugepaged_enter_mm(vma->vm_mm);
}

Then just invoke khugepaged_enter_mm() here.


>  	mmap_write_unlock(current->mm);
>  	return 0;
>  }
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 4ec324a4c1fe..88ac482fb3a0 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -413,7 +413,7 @@ static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm)
>  		mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm);
>  }
>
> -static bool hugepage_pmd_enabled(void)
> +bool hugepage_pmd_enabled(void)
>  {
>  	/*
>  	 * We cover the anon, shmem and the file-backed case here; file-backed
> @@ -445,6 +445,7 @@ void __khugepaged_enter(struct mm_struct *mm)
>
>  	/* __khugepaged_exit() must not run from under us */
>  	VM_BUG_ON_MM(hpage_collapse_test_exit(mm), mm);
> +	WARN_ON_ONCE(mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm));

Not sure why this needs to be a naked WARN_ON_ONCE()? Seems that'd be a
programmatic eror, so VM_WARN_ON_ONCE() more appropriate?

Can also change the VM_BUG_ON_MM() to VM_WARN_ON_ONCE_MM() while we're here.

>  	if (unlikely(mm_flags_test_and_set(MMF_VM_HUGEPAGE, mm)))
>  		return;
>
> @@ -472,7 +473,8 @@ void __khugepaged_enter(struct mm_struct *mm)
>  void khugepaged_enter_vma(struct vm_area_struct *vma,
>  			  vm_flags_t vm_flags)
>  {
> -	if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
> +	if (!mm_flags_test(MMF_DISABLE_THP_COMPLETELY, vma->vm_mm) &&
> +	    !mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
>  	    hugepage_pmd_enabled()) {
>  		if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
>  			__khugepaged_enter(vma->vm_mm);

See above, we can refactor this.

> @@ -1451,16 +1453,13 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
>
>  	lockdep_assert_held(&khugepaged_mm_lock);
>
> -	if (hpage_collapse_test_exit(mm)) {
> +	if (hpage_collapse_test_exit_or_disable(mm)) {
>  		/* free mm_slot */
>  		hash_del(&slot->hash);
>  		list_del(&slot->mm_node);
>
> -		/*
> -		 * Not strictly needed because the mm exited already.
> -		 *
> -		 * mm_flags_clear(MMF_VM_HUGEPAGE, mm);
> -		 */
> +		/* If the mm is disabled, this flag must be cleared. */
> +		mm_flags_clear(MMF_VM_HUGEPAGE, mm);
>
>  		/* khugepaged_mm_lock actually not necessary for the below */
>  		mm_slot_free(mm_slot_cache, mm_slot);
> @@ -2507,9 +2506,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
>  	VM_BUG_ON(khugepaged_scan.mm_slot != mm_slot);
>  	/*
>  	 * Release the current mm_slot if this mm is about to die, or
> -	 * if we scanned all vmas of this mm.
> +	 * if we scanned all vmas of this mm, or if this mm is disabled.
>  	 */
> -	if (hpage_collapse_test_exit(mm) || !vma) {
> +	if (hpage_collapse_test_exit_or_disable(mm) || !vma) {
>  		/*
>  		 * Make sure that if mm_users is reaching zero while
>  		 * khugepaged runs here, khugepaged_exit will find

Seems reasonable, but makes me wonder if we actually always want to invoke
hpage_collapse_test_exit_or_disable()?

I guess the VM_BUG_ON() (though it should be a VM_WARN_ON_ONCE()) in
__khugepaged_enter() is a legit use, but the only other case is
retract_page_tables().

I wonder if we should change this also? Seems reasonable to.

> --
> 2.47.3
>


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 01/10] mm: thp: remove disabled task from khugepaged_mm_slot
  2025-09-11  2:28       ` Zi Yan
  2025-09-11  2:35         ` Yafang Shao
  2025-09-11  2:38         ` Lance Yang
@ 2025-09-11 13:47         ` Lorenzo Stoakes
  2025-09-14  2:48           ` Yafang Shao
  2 siblings, 1 reply; 61+ messages in thread
From: Lorenzo Stoakes @ 2025-09-11 13:47 UTC (permalink / raw)
  To: Zi Yan
  Cc: Lance Yang, Yafang Shao, llvm, oe-kbuild-all, bpf, linux-mm,
	linux-doc, Lance Yang, akpm, gutierrez.asier, rientjes, andrii,
	david, baolin.wang, Liam.Howlett, ameryhung, ryan.roberts,
	usamaarif642, willy, corbet, npache, dev.jain, 21cnbao,
	shakeel.butt, ast, daniel, hannes, kernel test robot

On Wed, Sep 10, 2025 at 10:28:30PM -0400, Zi Yan wrote:
> On 10 Sep 2025, at 22:12, Lance Yang wrote:
>
> > Hi Yafang,
> >
> > On 2025/9/11 01:27, kernel test robot wrote:
> >> Hi Yafang,
> >>
> >> kernel test robot noticed the following build errors:
> >>
> >> [auto build test ERROR on akpm-mm/mm-everything]
> >>
> >> url:    https://github.com/intel-lab-lkp/linux/commits/Yafang-Shao/mm-thp-remove-disabled-task-from-khugepaged_mm_slot/20250910-144850
> >> base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> >> patch link:    https://lore.kernel.org/r/20250910024447.64788-2-laoar.shao%40gmail.com
> >> patch subject: [PATCH v7 mm-new 01/10] mm: thp: remove disabled task from khugepaged_mm_slot
> >> config: x86_64-allnoconfig (https://download.01.org/0day-ci/archive/20250911/202509110109.PSgSHb31-lkp@intel.com/config)
> >> compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
> >> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250911/202509110109.PSgSHb31-lkp@intel.com/reproduce)
> >>
> >> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> >> the same patch/commit), kindly add following tags
> >> | Reported-by: kernel test robot <lkp@intel.com>
> >> | Closes: https://lore.kernel.org/oe-kbuild-all/202509110109.PSgSHb31-lkp@intel.com/
> >>
> >> All errors (new ones prefixed by >>):
> >>
> >>>> kernel/sys.c:2500:6: error: call to undeclared function 'hugepage_pmd_enabled'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
> >>      2500 |             hugepage_pmd_enabled())
> >>           |             ^
> >>>> kernel/sys.c:2501:3: error: call to undeclared function '__khugepaged_enter'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
> >>      2501 |                 __khugepaged_enter(mm);
> >>           |                 ^
> >>     2 errors generated.
> >
> > Oops, seems like hugepage_pmd_enabled() and __khugepaged_enter() are only
> > available when CONFIG_TRANSPARENT_HUGEPAGE is enabled ;)
> >
> >>
> >>
> >> vim +/hugepage_pmd_enabled +2500 kernel/sys.c
> >>
> >>    2471
> >>    2472	static int prctl_set_thp_disable(bool thp_disable, unsigned long flags,
> >>    2473					 unsigned long arg4, unsigned long arg5)
> >>    2474	{
> >>    2475		struct mm_struct *mm = current->mm;
> >>    2476
> >>    2477		if (arg4 || arg5)
> >>    2478			return -EINVAL;
> >>    2479
> >>    2480		/* Flags are only allowed when disabling. */
> >>    2481		if ((!thp_disable && flags) || (flags & ~PR_THP_DISABLE_EXCEPT_ADVISED))
> >>    2482			return -EINVAL;
> >>    2483		if (mmap_write_lock_killable(current->mm))
> >>    2484			return -EINTR;
> >>    2485		if (thp_disable) {
> >>    2486			if (flags & PR_THP_DISABLE_EXCEPT_ADVISED) {
> >>    2487				mm_flags_clear(MMF_DISABLE_THP_COMPLETELY, mm);
> >>    2488				mm_flags_set(MMF_DISABLE_THP_EXCEPT_ADVISED, mm);
> >>    2489			} else {
> >>    2490				mm_flags_set(MMF_DISABLE_THP_COMPLETELY, mm);
> >>    2491				mm_flags_clear(MMF_DISABLE_THP_EXCEPT_ADVISED, mm);
> >>    2492			}
> >>    2493		} else {
> >>    2494			mm_flags_clear(MMF_DISABLE_THP_COMPLETELY, mm);
> >>    2495			mm_flags_clear(MMF_DISABLE_THP_EXCEPT_ADVISED, mm);
> >>    2496		}
> >>    2497
> >>    2498		if (!mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm) &&
> >>    2499		    !mm_flags_test(MMF_VM_HUGEPAGE, mm) &&
> >>> 2500		    hugepage_pmd_enabled())
> >>> 2501			__khugepaged_enter(mm);
> >>    2502		mmap_write_unlock(current->mm);
> >>    2503		return 0;
> >>    2504	}
> >>    2505
> >
> > So, let's wrap the new logic in an #ifdef CONFIG_TRANSPARENT_HUGEPAGE block.
> >
> > diff --git a/kernel/sys.c b/kernel/sys.c
> > index a1c1e8007f2d..c8600e017933 100644
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -2495,10 +2495,13 @@ static int prctl_set_thp_disable(bool thp_disable, unsigned long flags,
> >                 mm_flags_clear(MMF_DISABLE_THP_EXCEPT_ADVISED, mm);
> >         }
> >
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >         if (!mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm) &&
> >             !mm_flags_test(MMF_VM_HUGEPAGE, mm) &&
> >             hugepage_pmd_enabled())
> >                 __khugepaged_enter(mm);
> > +#endif
> > +
> >         mmap_write_unlock(current->mm);
> >         return 0;
> >  }
>
> Or in the header file,
>
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> ...
> #else
> bool hugepage_pmd_enabled()
> {
> 	return false;
> }
>
> int __khugepaged_enter(struct mm_struct *mm)
> {
> 	return 0;
> }

It seems we have a convention of just not implementing things here if they're
necessarily used in core code paths (and _with my suggested change_) it's _just_
khugepaged that's invoking them).

Anyway with my suggestion we can fix this entirely with:

#ifdef CONFIG_TRANSPARENT_HUGEPAGE

void khugepaged_enter_mm(struct mm_struct *mm);

#else

void khugepaged_enter_mm(struct mm_struct *mm)
{
}

#endif


Cheers, Lorenzo

> #endif
>
> Best Regards,
> Yan, Zi


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 02/10] mm: thp: add support for BPF based THP order selection
  2025-09-10 12:42   ` Lance Yang
  2025-09-10 12:54     ` Lance Yang
@ 2025-09-11 14:02     ` Lorenzo Stoakes
  2025-09-11 14:42       ` Lance Yang
  1 sibling, 1 reply; 61+ messages in thread
From: Lorenzo Stoakes @ 2025-09-11 14:02 UTC (permalink / raw)
  To: Lance Yang
  Cc: Yafang Shao, akpm, david, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
	willy, ast, daniel, andrii, ameryhung, rientjes, corbet, 21cnbao,
	shakeel.butt, bpf, linux-mm, linux-doc

On Wed, Sep 10, 2025 at 08:42:37PM +0800, Lance Yang wrote:
> Hey Yafang,
>
> On Wed, Sep 10, 2025 at 10:53 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
> > THP tuning. It includes a hook bpf_hook_thp_get_order(), allowing BPF
> > programs to influence THP order selection based on factors such as:
> > - Workload identity
> >   For example, workloads running in specific containers or cgroups.
> > - Allocation context
> >   Whether the allocation occurs during a page fault, khugepaged, swap or
> >   other paths.
> > - VMA's memory advice settings
> >   MADV_HUGEPAGE or MADV_NOHUGEPAGE
> > - Memory pressure
> >   PSI system data or associated cgroup PSI metrics
> >
> > The kernel API of this new BPF hook is as follows,
> >
> > /**
> >  * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation
> >  * @vma: vm_area_struct associated with the THP allocation
> >  * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
> >  *            BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if
> >  *            neither is set.
> >  * @tva_type: TVA type for current @vma
> >  * @orders: Bitmask of requested THP orders for this allocation
> >  *          - PMD-mapped allocation if PMD_ORDER is set
> >  *          - mTHP allocation otherwise
> >  *
> >  * Return: The suggested THP order from the BPF program for allocation. It will
> >  *         not exceed the highest requested order in @orders. Return -1 to
> >  *         indicate that the original requested @orders should remain unchanged.
> >  */
> > typedef int thp_order_fn_t(struct vm_area_struct *vma,
> >                            enum bpf_thp_vma_type vma_type,
> >                            enum tva_type tva_type,
> >                            unsigned long orders);
> >
> > Only a single BPF program can be attached at any given time, though it can
> > be dynamically updated to adjust the policy. The implementation supports
> > anonymous THP, shmem THP, and mTHP, with future extensions planned for
> > file-backed THP.
> >
> > This functionality is only active when system-wide THP is configured to
> > madvise or always mode. It remains disabled in never mode. Additionally,
> > if THP is explicitly disabled for a specific task via prctl(), this BPF
> > functionality will also be unavailable for that task.
> >
> > This feature requires CONFIG_BPF_GET_THP_ORDER (marked EXPERIMENTAL) to be
> > enabled. Note that this capability is currently unstable and may undergo
> > significant changes—including potential removal—in future kernel versions.
> >
> > Suggested-by: David Hildenbrand <david@redhat.com>
> > Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> [...]
> > diff --git a/mm/huge_memory_bpf.c b/mm/huge_memory_bpf.c
> > new file mode 100644
> > index 000000000000..525ee22ab598
> > --- /dev/null
> > +++ b/mm/huge_memory_bpf.c
> > @@ -0,0 +1,243 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * BPF-based THP policy management
> > + *
> > + * Author: Yafang Shao <laoar.shao@gmail.com>
> > + */
> > +
> > +#include <linux/bpf.h>
> > +#include <linux/btf.h>
> > +#include <linux/huge_mm.h>
> > +#include <linux/khugepaged.h>
> > +
> > +enum bpf_thp_vma_type {
> > +       BPF_THP_VM_NONE = 0,
> > +       BPF_THP_VM_HUGEPAGE,    /* VM_HUGEPAGE */
> > +       BPF_THP_VM_NOHUGEPAGE,  /* VM_NOHUGEPAGE */
> > +};
> > +
> > +/**
> > + * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation
> > + * @vma: vm_area_struct associated with the THP allocation
> > + * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
> > + *            BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if
> > + *            neither is set.
> > + * @tva_type: TVA type for current @vma
> > + * @orders: Bitmask of requested THP orders for this allocation
> > + *          - PMD-mapped allocation if PMD_ORDER is set
> > + *          - mTHP allocation otherwise
> > + *
> > + * Return: The suggested THP order from the BPF program for allocation. It will
> > + *         not exceed the highest requested order in @orders. Return -1 to
> > + *         indicate that the original requested @orders should remain unchanged.
>
> A minor documentation nit: the comment says "Return -1 to indicate that the
> original requested @orders should remain unchanged". It might be slightly
> clearer to say "Return a negative value to fall back to the original
> behavior". This would cover all error codes as well ;)
>
> > + */
> > +typedef int thp_order_fn_t(struct vm_area_struct *vma,
> > +                          enum bpf_thp_vma_type vma_type,
> > +                          enum tva_type tva_type,
> > +                          unsigned long orders);
>
> Sorry if I'm missing some context here since I haven't tracked the whole
> series closely.
>
> Regarding the return value for thp_order_fn_t: right now it returns a
> single int order. I was thinking, what if we let it return an unsigned
> long bitmask of orders instead? This seems like it would be more flexible
> down the road, especially if we get more mTHP sizes to choose from. It
> would also make the API more consistent, as bpf_hook_thp_get_orders()
> itself returns an unsigned long ;)

I think that adds confusion - as in how an order might be chosen from
those. Also we have _received_ a bitmap of available orders - and the intent
here is to select _which one we should use_.

And this is an experimental feature, behind a flag explicitly labelled as
experimental (and thus subject to change) so if we found we needed to change
things in the future we can.

>
> Also, for future extensions, it might be a good idea to add a reserved
> flags argument to the thp_order_fn_t signature.

We don't need to do anything like this, as we are behind an experimental flag
and in no way guarantee that this will be used this way going forwards.
>
> For example thp_order_fn_t(..., unsigned long flags).
>
> This would give us aforward-compatible way to add new semantics later
> without breaking the ABI and needing a v2. We could just require it to be
> 0 for now.

There is no ABI.

I mean again to emphasise, this is an _experimental_ feature not to be relied
upon in production.

>
> Thanks for the great work!
> Lance

Perhaps we need to put a 'EXPERIMENTAL_' prefix on the config flag too to really
bring this home, as it's perhaps not all that clear :)

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 02/10] mm: thp: add support for BPF based THP order selection
  2025-09-10  2:44 ` [PATCH v7 mm-new 02/10] mm: thp: add support for BPF based THP order selection Yafang Shao
  2025-09-10 12:42   ` Lance Yang
@ 2025-09-11 14:33   ` Lorenzo Stoakes
  2025-09-12  8:28     ` Yafang Shao
  2025-09-11 14:51   ` Lorenzo Stoakes
  2025-09-25 10:05   ` Lance Yang
  3 siblings, 1 reply; 61+ messages in thread
From: Lorenzo Stoakes @ 2025-09-11 14:33 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, 21cnbao,
	shakeel.butt, bpf, linux-mm, linux-doc

On Wed, Sep 10, 2025 at 10:44:39AM +0800, Yafang Shao wrote:
> This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
> THP tuning. It includes a hook bpf_hook_thp_get_order(), allowing BPF
> programs to influence THP order selection based on factors such as:
> - Workload identity
>   For example, workloads running in specific containers or cgroups.
> - Allocation context
>   Whether the allocation occurs during a page fault, khugepaged, swap or
>   other paths.
> - VMA's memory advice settings
>   MADV_HUGEPAGE or MADV_NOHUGEPAGE
> - Memory pressure
>   PSI system data or associated cgroup PSI metrics
>
> The kernel API of this new BPF hook is as follows,
>
> /**
>  * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation
>  * @vma: vm_area_struct associated with the THP allocation
>  * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
>  *            BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if
>  *            neither is set.
>  * @tva_type: TVA type for current @vma
>  * @orders: Bitmask of requested THP orders for this allocation
>  *          - PMD-mapped allocation if PMD_ORDER is set
>  *          - mTHP allocation otherwise
>  *
>  * Return: The suggested THP order from the BPF program for allocation. It will
>  *         not exceed the highest requested order in @orders. Return -1 to
>  *         indicate that the original requested @orders should remain unchanged.
>  */
> typedef int thp_order_fn_t(struct vm_area_struct *vma,
> 			   enum bpf_thp_vma_type vma_type,
> 			   enum tva_type tva_type,
> 			   unsigned long orders);
>
> Only a single BPF program can be attached at any given time, though it can
> be dynamically updated to adjust the policy. The implementation supports
> anonymous THP, shmem THP, and mTHP, with future extensions planned for
> file-backed THP.
>
> This functionality is only active when system-wide THP is configured to
> madvise or always mode. It remains disabled in never mode. Additionally,
> if THP is explicitly disabled for a specific task via prctl(), this BPF
> functionality will also be unavailable for that task.
>
> This feature requires CONFIG_BPF_GET_THP_ORDER (marked EXPERIMENTAL) to be
> enabled. Note that this capability is currently unstable and may undergo
> significant changes—including potential removal—in future kernel versions.

Thanks for highlighting.

>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  MAINTAINERS             |   1 +
>  include/linux/huge_mm.h |  26 ++++-
>  mm/Kconfig              |  12 ++
>  mm/Makefile             |   1 +
>  mm/huge_memory_bpf.c    | 243 ++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 280 insertions(+), 3 deletions(-)
>  create mode 100644 mm/huge_memory_bpf.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 8fef05bc2224..d055a3c95300 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -16252,6 +16252,7 @@ F:	include/linux/huge_mm.h
>  F:	include/linux/khugepaged.h
>  F:	include/trace/events/huge_memory.h
>  F:	mm/huge_memory.c
> +F:	mm/huge_memory_bpf.c

THanks!

>  F:	mm/khugepaged.c
>  F:	mm/mm_slot.h
>  F:	tools/testing/selftests/mm/khugepaged.c
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 23f124493c47..f72a5fd04e4f 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -56,6 +56,7 @@ enum transparent_hugepage_flag {
>  	TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
>  	TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
>  	TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
> +	TRANSPARENT_HUGEPAGE_BPF_ATTACHED,      /* BPF prog is attached */
>  };
>
>  struct kobject;
> @@ -270,6 +271,19 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>  					 enum tva_type type,
>  					 unsigned long orders);
>
> +#ifdef CONFIG_BPF_GET_THP_ORDER
> +unsigned long
> +bpf_hook_thp_get_orders(struct vm_area_struct *vma, vm_flags_t vma_flags,
> +			enum tva_type type, unsigned long orders);

Thanks for renaming!

> +#else
> +static inline unsigned long
> +bpf_hook_thp_get_orders(struct vm_area_struct *vma, vm_flags_t vma_flags,
> +			enum tva_type tva_flags, unsigned long orders)
> +{
> +	return orders;
> +}
> +#endif
> +
>  /**
>   * thp_vma_allowable_orders - determine hugepage orders that are allowed for vma
>   * @vma:  the vm area to check
> @@ -291,6 +305,12 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>  				       enum tva_type type,
>  				       unsigned long orders)
>  {
> +	unsigned long bpf_orders;
> +
> +	bpf_orders = bpf_hook_thp_get_orders(vma, vm_flags, type, orders);
> +	if (!bpf_orders)
> +		return 0;

I think it'd be easier to just do:

	/* The BPF-specified order overrides which order is selected. */
	orders &= bpf_hook_thp_get_orders(vma, vm_flags, type, orders);
	if (!orders)
		return 0;

> +
>  	/*
>  	 * Optimization to check if required orders are enabled early. Only
>  	 * forced collapse ignores sysfs configs.
> @@ -304,12 +324,12 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>  		    ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()))
>  			mask |= READ_ONCE(huge_anon_orders_inherit);
>
> -		orders &= mask;
> -		if (!orders)
> +		bpf_orders &= mask;
> +		if (!bpf_orders)
>  			return 0

With my suggeted change this would remain the same.

>  	}
>
> -	return __thp_vma_allowable_orders(vma, vm_flags, type, orders);
> +	return __thp_vma_allowable_orders(vma, vm_flags, type, bpf_orders);

With my suggeted change this would remain the same.

>  }
>
>  struct thpsize {
> diff --git a/mm/Kconfig b/mm/Kconfig
> index d1ed839ca710..4d89d2158f10 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -896,6 +896,18 @@ config NO_PAGE_MAPCOUNT
>
>  	  EXPERIMENTAL because the impact of some changes is still unclear.
>
> +config BPF_GET_THP_ORDER

Yeah, I think we maybe need to sledgehammer this as already Lance was confused
as to the permenancy of this, and I feel that users might be too, even with the
'(EXPERIMENTAL)' bit.

So maybe

config BPF_GET_THP_ORDER_EXPERIMENTAL

Just to hammer it home?

> +	bool "BPF-based THP order selection (EXPERIMENTAL)"
> +	depends on TRANSPARENT_HUGEPAGE && BPF_SYSCALL
> +
> +	help
> +	  Enable dynamic THP order selection using BPF programs. This
> +	  experimental feature allows custom BPF logic to determine optimal
> +	  transparent hugepage allocation sizes at runtime.
> +
> +	  WARNING: This feature is unstable and may change in future kernel
> +	  versions.
> +
>  endif # TRANSPARENT_HUGEPAGE
>
>  # simple helper to make the code a bit easier to read
> diff --git a/mm/Makefile b/mm/Makefile
> index 21abb3353550..f180332f2ad0 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -99,6 +99,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
>  obj-$(CONFIG_NUMA) += memory-tiers.o
>  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> +obj-$(CONFIG_BPF_GET_THP_ORDER) += huge_memory_bpf.o
>  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
>  obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
>  obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
> diff --git a/mm/huge_memory_bpf.c b/mm/huge_memory_bpf.c
> new file mode 100644
> index 000000000000..525ee22ab598
> --- /dev/null
> +++ b/mm/huge_memory_bpf.c
> @@ -0,0 +1,243 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * BPF-based THP policy management
> + *
> + * Author: Yafang Shao <laoar.shao@gmail.com>
> + */
> +
> +#include <linux/bpf.h>
> +#include <linux/btf.h>
> +#include <linux/huge_mm.h>
> +#include <linux/khugepaged.h>
> +
> +enum bpf_thp_vma_type {
> +	BPF_THP_VM_NONE = 0,
> +	BPF_THP_VM_HUGEPAGE,	/* VM_HUGEPAGE */
> +	BPF_THP_VM_NOHUGEPAGE,	/* VM_NOHUGEPAGE */
> +};

I'm really not so sure how useful this is - can't a user just ascertain this
from the VMA flags themselves?

Let's keep the interface as minimal as possible.

> +
> +/**
> + * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation

orders -> order?

> + * @vma: vm_area_struct associated with the THP allocation
> + * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
> + *            BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if
> + *            neither is set.

Obv as above let's drop this probably :)

> + * @tva_type: TVA type for current @vma
> + * @orders: Bitmask of requested THP orders for this allocation

Shouldn't requested = available?

> + *          - PMD-mapped allocation if PMD_ORDER is set
> + *          - mTHP allocation otherwise

Not sure these 2 points are super useful.

> + *
> + * Return: The suggested THP order from the BPF program for allocation. It will
> + *         not exceed the highest requested order in @orders. Return -1 to
> + *         indicate that the original requested @orders should remain unchanged.
> + */
> +typedef int thp_order_fn_t(struct vm_area_struct *vma,
> +			   enum bpf_thp_vma_type vma_type,
> +			   enum tva_type tva_type,
> +			   unsigned long orders);
> +
> +struct bpf_thp_ops {
> +	thp_order_fn_t __rcu *thp_get_order;
> +};
> +
> +static struct bpf_thp_ops bpf_thp;
> +static DEFINE_SPINLOCK(thp_ops_lock);
> +
> +/*
> + * Returns the original @orders if no BPF program is attached or if the
> + * suggested order is invalid.
> + */
> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
> +				      vm_flags_t vma_flags,
> +				      enum tva_type tva_type,
> +				      unsigned long orders)
> +{
> +	thp_order_fn_t *bpf_hook_thp_get_order;
> +	unsigned long thp_orders = orders;
> +	enum bpf_thp_vma_type vma_type;
> +	int thp_order;
> +
> +	/* No BPF program is attached */
> +	if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> +		      &transparent_hugepage_flags))
> +		return orders;
> +
> +	if (vma_flags & VM_HUGEPAGE)
> +		vma_type = BPF_THP_VM_HUGEPAGE;
> +	else if (vma_flags & VM_NOHUGEPAGE)
> +		vma_type = BPF_THP_VM_NOHUGEPAGE;
> +	else
> +		vma_type = BPF_THP_VM_NONE;

As per above, not sure this is all that useful.

> +
> +	rcu_read_lock();
> +	bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
> +	if (!bpf_hook_thp_get_order)
> +		goto out;
> +
> +	thp_order = bpf_hook_thp_get_order(vma, vma_type, tva_type, orders);
> +	if (thp_order < 0)
> +		goto out;
> +	/*
> +	 * The maximum requested order is determined by the callsite. E.g.:
> +	 * - PMD-mapped THP uses PMD_ORDER
> +	 * - mTHP uses (PMD_ORDER - 1)

I don't think this is quite right, highest_order() figures out the highest set
bit, so mTHP can be PMD_ORDER - 1 or less (in theory ofc).

I think we can just replace this with something simpler like - 'depending on
where the BPF hook is invoked, we check for either PMD order or mTHP orders
(less than PMD order)' or something.

> +	 *
> +	 * We must respect this upper bound to avoid undefined behavior. So the
> +	 * highest suggested order can't exceed the highest requested order.
> +	 */

I think this sentence is also unnecessary.

> +	if (thp_order <= highest_order(orders))
> +		thp_orders = BIT(thp_order);
> +
> +out:
> +	rcu_read_unlock();
> +	return thp_orders;
> +}
> +
> +static bool bpf_thp_ops_is_valid_access(int off, int size,
> +					enum bpf_access_type type,
> +					const struct bpf_prog *prog,
> +					struct bpf_insn_access_aux *info)
> +{
> +	return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
> +}
> +
> +static const struct bpf_func_proto *
> +bpf_thp_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> +{
> +	return bpf_base_func_proto(func_id, prog);
> +}
> +
> +static const struct bpf_verifier_ops thp_bpf_verifier_ops = {
> +	.get_func_proto = bpf_thp_get_func_proto,
> +	.is_valid_access = bpf_thp_ops_is_valid_access,
> +};
> +
> +static int bpf_thp_init(struct btf *btf)
> +{
> +	return 0;
> +}
> +
> +static int bpf_thp_check_member(const struct btf_type *t,
> +				const struct btf_member *member,
> +				const struct bpf_prog *prog)
> +{
> +	/* The call site operates under RCU protection. */
> +	if (prog->sleepable)
> +		return -EINVAL;
> +	return 0;
> +}
> +
> +static int bpf_thp_init_member(const struct btf_type *t,
> +			       const struct btf_member *member,
> +			       void *kdata, const void *udata)
> +{
> +	return 0;
> +}
> +
> +static int bpf_thp_reg(void *kdata, struct bpf_link *link)
> +{
> +	struct bpf_thp_ops *ops = kdata;
> +
> +	spin_lock(&thp_ops_lock);
> +	if (test_and_set_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> +			     &transparent_hugepage_flags)) {
> +		spin_unlock(&thp_ops_lock);
> +		return -EBUSY;
> +	}
> +	WARN_ON_ONCE(rcu_access_pointer(bpf_thp.thp_get_order));
> +	rcu_assign_pointer(bpf_thp.thp_get_order, ops->thp_get_order);
> +	spin_unlock(&thp_ops_lock);
> +	return 0;
> +}
> +
> +static void bpf_thp_unreg(void *kdata, struct bpf_link *link)
> +{
> +	thp_order_fn_t *old_fn;
> +
> +	spin_lock(&thp_ops_lock);
> +	clear_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, &transparent_hugepage_flags);
> +	old_fn = rcu_replace_pointer(bpf_thp.thp_get_order, NULL,
> +				     lockdep_is_held(&thp_ops_lock));
> +	WARN_ON_ONCE(!old_fn);
> +	spin_unlock(&thp_ops_lock);
> +
> +	synchronize_rcu();
> +}
> +
> +static int bpf_thp_update(void *kdata, void *old_kdata, struct bpf_link *link)
> +{
> +	thp_order_fn_t *old_fn, *new_fn;
> +	struct bpf_thp_ops *old = old_kdata;
> +	struct bpf_thp_ops *ops = kdata;
> +	int ret = 0;
> +
> +	if (!ops || !old)
> +		return -EINVAL;
> +
> +	spin_lock(&thp_ops_lock);
> +	/* The prog has aleady been removed. */
> +	if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> +		      &transparent_hugepage_flags)) {
> +		ret = -ENOENT;
> +		goto out;
> +	}
> +
> +	new_fn = rcu_dereference(ops->thp_get_order);
> +	old_fn = rcu_replace_pointer(bpf_thp.thp_get_order, new_fn,
> +				     lockdep_is_held(&thp_ops_lock));
> +	WARN_ON_ONCE(!old_fn || !new_fn);
> +
> +out:
> +	spin_unlock(&thp_ops_lock);
> +	if (!ret)
> +		synchronize_rcu();
> +	return ret;
> +}
> +
> +static int bpf_thp_validate(void *kdata)
> +{
> +	struct bpf_thp_ops *ops = kdata;
> +
> +	if (!ops->thp_get_order) {
> +		pr_err("bpf_thp: required ops isn't implemented\n");
> +		return -EINVAL;
> +	}
> +	return 0;
> +}
> +
> +static int bpf_thp_get_order(struct vm_area_struct *vma,
> +			     enum bpf_thp_vma_type vma_type,
> +			     enum tva_type tva_type,
> +			     unsigned long orders)
> +{
> +	return -1;
> +}
> +
> +static struct bpf_thp_ops __bpf_thp_ops = {
> +	.thp_get_order = (thp_order_fn_t __rcu *)bpf_thp_get_order,
> +};
> +
> +static struct bpf_struct_ops bpf_bpf_thp_ops = {
> +	.verifier_ops = &thp_bpf_verifier_ops,
> +	.init = bpf_thp_init,
> +	.check_member = bpf_thp_check_member,
> +	.init_member = bpf_thp_init_member,
> +	.reg = bpf_thp_reg,
> +	.unreg = bpf_thp_unreg,
> +	.update = bpf_thp_update,
> +	.validate = bpf_thp_validate,
> +	.cfi_stubs = &__bpf_thp_ops,
> +	.owner = THIS_MODULE,
> +	.name = "bpf_thp_ops",
> +};
> +
> +static int __init bpf_thp_ops_init(void)
> +{
> +	int err;
> +
> +	err = register_bpf_struct_ops(&bpf_bpf_thp_ops, bpf_thp_ops);
> +	if (err)
> +		pr_err("bpf_thp: Failed to register struct_ops (%d)\n", err);
> +	return err;
> +}
> +late_initcall(bpf_thp_ops_init);
> --
> 2.47.3
>


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 02/10] mm: thp: add support for BPF based THP order selection
  2025-09-11 14:02     ` Lorenzo Stoakes
@ 2025-09-11 14:42       ` Lance Yang
  2025-09-11 14:58         ` Lorenzo Stoakes
  0 siblings, 1 reply; 61+ messages in thread
From: Lance Yang @ 2025-09-11 14:42 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Yafang Shao, akpm, david, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
	willy, ast, daniel, andrii, ameryhung, rientjes, corbet, 21cnbao,
	shakeel.butt, bpf, linux-mm, linux-doc



On 2025/9/11 22:02, Lorenzo Stoakes wrote:
> On Wed, Sep 10, 2025 at 08:42:37PM +0800, Lance Yang wrote:
>> Hey Yafang,
>>
>> On Wed, Sep 10, 2025 at 10:53 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>>>
>>> This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
>>> THP tuning. It includes a hook bpf_hook_thp_get_order(), allowing BPF
>>> programs to influence THP order selection based on factors such as:
>>> - Workload identity
>>>    For example, workloads running in specific containers or cgroups.
>>> - Allocation context
>>>    Whether the allocation occurs during a page fault, khugepaged, swap or
>>>    other paths.
>>> - VMA's memory advice settings
>>>    MADV_HUGEPAGE or MADV_NOHUGEPAGE
>>> - Memory pressure
>>>    PSI system data or associated cgroup PSI metrics
>>>
>>> The kernel API of this new BPF hook is as follows,
>>>
>>> /**
>>>   * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation
>>>   * @vma: vm_area_struct associated with the THP allocation
>>>   * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
>>>   *            BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if
>>>   *            neither is set.
>>>   * @tva_type: TVA type for current @vma
>>>   * @orders: Bitmask of requested THP orders for this allocation
>>>   *          - PMD-mapped allocation if PMD_ORDER is set
>>>   *          - mTHP allocation otherwise
>>>   *
>>>   * Return: The suggested THP order from the BPF program for allocation. It will
>>>   *         not exceed the highest requested order in @orders. Return -1 to
>>>   *         indicate that the original requested @orders should remain unchanged.
>>>   */
>>> typedef int thp_order_fn_t(struct vm_area_struct *vma,
>>>                             enum bpf_thp_vma_type vma_type,
>>>                             enum tva_type tva_type,
>>>                             unsigned long orders);
>>>
>>> Only a single BPF program can be attached at any given time, though it can
>>> be dynamically updated to adjust the policy. The implementation supports
>>> anonymous THP, shmem THP, and mTHP, with future extensions planned for
>>> file-backed THP.
>>>
>>> This functionality is only active when system-wide THP is configured to
>>> madvise or always mode. It remains disabled in never mode. Additionally,
>>> if THP is explicitly disabled for a specific task via prctl(), this BPF
>>> functionality will also be unavailable for that task.
>>>
>>> This feature requires CONFIG_BPF_GET_THP_ORDER (marked EXPERIMENTAL) to be
>>> enabled. Note that this capability is currently unstable and may undergo
>>> significant changes—including potential removal—in future kernel versions.
>>>
>>> Suggested-by: David Hildenbrand <david@redhat.com>
>>> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
>>> ---
>> [...]
>>> diff --git a/mm/huge_memory_bpf.c b/mm/huge_memory_bpf.c
>>> new file mode 100644
>>> index 000000000000..525ee22ab598
>>> --- /dev/null
>>> +++ b/mm/huge_memory_bpf.c
>>> @@ -0,0 +1,243 @@
>>> +// SPDX-License-Identifier: GPL-2.0
>>> +/*
>>> + * BPF-based THP policy management
>>> + *
>>> + * Author: Yafang Shao <laoar.shao@gmail.com>
>>> + */
>>> +
>>> +#include <linux/bpf.h>
>>> +#include <linux/btf.h>
>>> +#include <linux/huge_mm.h>
>>> +#include <linux/khugepaged.h>
>>> +
>>> +enum bpf_thp_vma_type {
>>> +       BPF_THP_VM_NONE = 0,
>>> +       BPF_THP_VM_HUGEPAGE,    /* VM_HUGEPAGE */
>>> +       BPF_THP_VM_NOHUGEPAGE,  /* VM_NOHUGEPAGE */
>>> +};
>>> +
>>> +/**
>>> + * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation
>>> + * @vma: vm_area_struct associated with the THP allocation
>>> + * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
>>> + *            BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if
>>> + *            neither is set.
>>> + * @tva_type: TVA type for current @vma
>>> + * @orders: Bitmask of requested THP orders for this allocation
>>> + *          - PMD-mapped allocation if PMD_ORDER is set
>>> + *          - mTHP allocation otherwise
>>> + *
>>> + * Return: The suggested THP order from the BPF program for allocation. It will
>>> + *         not exceed the highest requested order in @orders. Return -1 to
>>> + *         indicate that the original requested @orders should remain unchanged.
>>
>> A minor documentation nit: the comment says "Return -1 to indicate that the
>> original requested @orders should remain unchanged". It might be slightly
>> clearer to say "Return a negative value to fall back to the original
>> behavior". This would cover all error codes as well ;)
>>
>>> + */
>>> +typedef int thp_order_fn_t(struct vm_area_struct *vma,
>>> +                          enum bpf_thp_vma_type vma_type,
>>> +                          enum tva_type tva_type,
>>> +                          unsigned long orders);
>>
>> Sorry if I'm missing some context here since I haven't tracked the whole
>> series closely.
>>
>> Regarding the return value for thp_order_fn_t: right now it returns a
>> single int order. I was thinking, what if we let it return an unsigned
>> long bitmask of orders instead? This seems like it would be more flexible
>> down the road, especially if we get more mTHP sizes to choose from. It
>> would also make the API more consistent, as bpf_hook_thp_get_orders()
>> itself returns an unsigned long ;)
> 
> I think that adds confusion - as in how an order might be chosen from
> those. Also we have _received_ a bitmap of available orders - and the intent
> here is to select _which one we should use_.

Yep. Makes sense to me ;)

> 
> And this is an experimental feature, behind a flag explicitly labelled as
> experimental (and thus subject to change) so if we found we needed to change
> things in the future we can.

You're right, I didn't pay enough attention to the fact that this is
an experimental feature. So my suggestions were based on a lack of
context ...

> 
>>
>> Also, for future extensions, it might be a good idea to add a reserved
>> flags argument to the thp_order_fn_t signature.
> 
> We don't need to do anything like this, as we are behind an experimental flag
> and in no way guarantee that this will be used this way going forwards.
>>
>> For example thp_order_fn_t(..., unsigned long flags).
>>
>> This would give us aforward-compatible way to add new semantics later
>> without breaking the ABI and needing a v2. We could just require it to be
>> 0 for now.
> 
> There is no ABI.
> 
> I mean again to emphasise, this is an _experimental_ feature not to be relied
> upon in production.
> 
>>
>> Thanks for the great work!
>> Lance
> 
> Perhaps we need to put a 'EXPERIMENTAL_' prefix on the config flag too to really
> bring this home, as it's perhaps not all that clear :)

No need for a 'EXPERIMENTAL_' prefix, it was just me missing
the background. Appreciate you clarifying this!

Cheers,
Lance



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 02/10] mm: thp: add support for BPF based THP order selection
  2025-09-10 13:56       ` Lance Yang
  2025-09-11  2:48         ` Yafang Shao
@ 2025-09-11 14:45         ` Lorenzo Stoakes
  1 sibling, 0 replies; 61+ messages in thread
From: Lorenzo Stoakes @ 2025-09-11 14:45 UTC (permalink / raw)
  To: Lance Yang
  Cc: Yafang Shao, akpm, david, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
	willy, ast, daniel, andrii, ameryhung, rientjes, corbet, 21cnbao,
	shakeel.butt, bpf, linux-mm, linux-doc

On Wed, Sep 10, 2025 at 09:56:47PM +0800, Lance Yang wrote:
>
>
> On 2025/9/10 20:54, Lance Yang wrote:
> > On Wed, Sep 10, 2025 at 8:42 PM Lance Yang <lance.yang@linux.dev> wrote:
> > >
> > > Hey Yafang,
> > >
> > > On Wed, Sep 10, 2025 at 10:53 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
> > > > THP tuning. It includes a hook bpf_hook_thp_get_order(), allowing BPF
> > > > programs to influence THP order selection based on factors such as:
> > > > - Workload identity
> > > >    For example, workloads running in specific containers or cgroups.
> > > > - Allocation context
> > > >    Whether the allocation occurs during a page fault, khugepaged, swap or
> > > >    other paths.
> > > > - VMA's memory advice settings
> > > >    MADV_HUGEPAGE or MADV_NOHUGEPAGE
> > > > - Memory pressure
> > > >    PSI system data or associated cgroup PSI metrics
> > > >
> > > > The kernel API of this new BPF hook is as follows,
> > > >
> > > > /**
> > > >   * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation
> > > >   * @vma: vm_area_struct associated with the THP allocation
> > > >   * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
> > > >   *            BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if
> > > >   *            neither is set.
> > > >   * @tva_type: TVA type for current @vma
> > > >   * @orders: Bitmask of requested THP orders for this allocation
> > > >   *          - PMD-mapped allocation if PMD_ORDER is set
> > > >   *          - mTHP allocation otherwise
> > > >   *
> > > >   * Return: The suggested THP order from the BPF program for allocation. It will
> > > >   *         not exceed the highest requested order in @orders. Return -1 to
> > > >   *         indicate that the original requested @orders should remain unchanged.
> > > >   */
> > > > typedef int thp_order_fn_t(struct vm_area_struct *vma,
> > > >                             enum bpf_thp_vma_type vma_type,
> > > >                             enum tva_type tva_type,
> > > >                             unsigned long orders);
> > > >
> > > > Only a single BPF program can be attached at any given time, though it can
> > > > be dynamically updated to adjust the policy. The implementation supports
> > > > anonymous THP, shmem THP, and mTHP, with future extensions planned for
> > > > file-backed THP.
> > > >
> > > > This functionality is only active when system-wide THP is configured to
> > > > madvise or always mode. It remains disabled in never mode. Additionally,
> > > > if THP is explicitly disabled for a specific task via prctl(), this BPF
> > > > functionality will also be unavailable for that task.
> > > >
> > > > This feature requires CONFIG_BPF_GET_THP_ORDER (marked EXPERIMENTAL) to be
> > > > enabled. Note that this capability is currently unstable and may undergo
> > > > significant changes—including potential removal—in future kernel versions.
> > > >
> > > > Suggested-by: David Hildenbrand <david@redhat.com>
> > > > Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > > > ---
> > > [...]
> > > > diff --git a/mm/huge_memory_bpf.c b/mm/huge_memory_bpf.c
> > > > new file mode 100644
> > > > index 000000000000..525ee22ab598
> > > > --- /dev/null
> > > > +++ b/mm/huge_memory_bpf.c
> > > > @@ -0,0 +1,243 @@
> > > > +// SPDX-License-Identifier: GPL-2.0
> > > > +/*
> > > > + * BPF-based THP policy management
> > > > + *
> > > > + * Author: Yafang Shao <laoar.shao@gmail.com>
> > > > + */
> > > > +
> > > > +#include <linux/bpf.h>
> > > > +#include <linux/btf.h>
> > > > +#include <linux/huge_mm.h>
> > > > +#include <linux/khugepaged.h>
> > > > +
> > > > +enum bpf_thp_vma_type {
> > > > +       BPF_THP_VM_NONE = 0,
> > > > +       BPF_THP_VM_HUGEPAGE,    /* VM_HUGEPAGE */
> > > > +       BPF_THP_VM_NOHUGEPAGE,  /* VM_NOHUGEPAGE */
> > > > +};
> > > > +
> > > > +/**
> > > > + * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation
> > > > + * @vma: vm_area_struct associated with the THP allocation
> > > > + * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
> > > > + *            BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if
> > > > + *            neither is set.
> > > > + * @tva_type: TVA type for current @vma
> > > > + * @orders: Bitmask of requested THP orders for this allocation
> > > > + *          - PMD-mapped allocation if PMD_ORDER is set
> > > > + *          - mTHP allocation otherwise
> > > > + *
> > > > + * Return: The suggested THP order from the BPF program for allocation. It will
> > > > + *         not exceed the highest requested order in @orders. Return -1 to
> > > > + *         indicate that the original requested @orders should remain unchanged.
> > >
> > > A minor documentation nit: the comment says "Return -1 to indicate that the
> > > original requested @orders should remain unchanged". It might be slightly
> > > clearer to say "Return a negative value to fall back to the original
> > > behavior". This would cover all error codes as well ;)
> > >
> > > > + */
> > > > +typedef int thp_order_fn_t(struct vm_area_struct *vma,
> > > > +                          enum bpf_thp_vma_type vma_type,
> > > > +                          enum tva_type tva_type,
> > > > +                          unsigned long orders);
> > >
> > > Sorry if I'm missing some context here since I haven't tracked the whole
> > > series closely.
> > >
> > > Regarding the return value for thp_order_fn_t: right now it returns a
> > > single int order. I was thinking, what if we let it return an unsigned
> > > long bitmask of orders instead? This seems like it would be more flexible
> > > down the road, especially if we get more mTHP sizes to choose from. It
> > > would also make the API more consistent, as bpf_hook_thp_get_orders()
> > > itself returns an unsigned long ;)
> >
> > I just realized a flaw in my previous suggestion :(
> >
> > Changing the return type of thp_order_fn_t to unsigned long for consistency
> > and flexibility. However, I completely overlooked that this would prevent
> > the BPF program from returning negative error codes ...
> >
> > Thanks,
> > Lance
> >
> > >
> > > Also, for future extensions, it might be a good idea to add a reserved
> > > flags argument to the thp_order_fn_t signature.
> > >
> > > For example thp_order_fn_t(..., unsigned long flags).
> > >
> > > This would give us aforward-compatible way to add new semantics later
> > > without breaking the ABI and needing a v2. We could just require it to be
> > > 0 for now.
> > >
> > > Thanks for the great work!
> > > Lance
>
>
> Forgot to add:
>
> Noticed that if the hook returns 0, bpf_hook_thp_get_orders() falls
> back to 'orders', preventing us from dynamically disabling mTHP
> allocations.
>
> Honoring a return of 0 is critical for our use case, which is to
> dynamically disable mTHP for low-priority containers when memory gets
> low in mixed workloads.

Right, yeah we shouldn't just default back to orders should we. The user _knows_
what orders are available and should select one.

Actually that logic is a bit weird overall let me reply again to patch...

>
> And then re-enable it for them when memory is back above the low
> watermark.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 02/10] mm: thp: add support for BPF based THP order selection
  2025-09-10  2:44 ` [PATCH v7 mm-new 02/10] mm: thp: add support for BPF based THP order selection Yafang Shao
  2025-09-10 12:42   ` Lance Yang
  2025-09-11 14:33   ` Lorenzo Stoakes
@ 2025-09-11 14:51   ` Lorenzo Stoakes
  2025-09-12  8:03     ` Yafang Shao
  2025-09-25 10:05   ` Lance Yang
  3 siblings, 1 reply; 61+ messages in thread
From: Lorenzo Stoakes @ 2025-09-11 14:51 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, 21cnbao,
	shakeel.butt, bpf, linux-mm, linux-doc

On Wed, Sep 10, 2025 at 10:44:39AM +0800, Yafang Shao wrote:
> diff --git a/mm/huge_memory_bpf.c b/mm/huge_memory_bpf.c
> new file mode 100644
> index 000000000000..525ee22ab598
> --- /dev/null
> +++ b/mm/huge_memory_bpf.c

[snip]

> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
> +				      vm_flags_t vma_flags,
> +				      enum tva_type tva_type,
> +				      unsigned long orders)
> +{
> +	thp_order_fn_t *bpf_hook_thp_get_order;
> +	unsigned long thp_orders = orders;
> +	enum bpf_thp_vma_type vma_type;
> +	int thp_order;
> +
> +	/* No BPF program is attached */
> +	if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> +		      &transparent_hugepage_flags))
> +		return orders;
> +
> +	if (vma_flags & VM_HUGEPAGE)
> +		vma_type = BPF_THP_VM_HUGEPAGE;
> +	else if (vma_flags & VM_NOHUGEPAGE)
> +		vma_type = BPF_THP_VM_NOHUGEPAGE;
> +	else
> +		vma_type = BPF_THP_VM_NONE;
> +
> +	rcu_read_lock();
> +	bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
> +	if (!bpf_hook_thp_get_order)
> +		goto out;
> +
> +	thp_order = bpf_hook_thp_get_order(vma, vma_type, tva_type, orders);
> +	if (thp_order < 0)
> +		goto out;
> +	/*
> +	 * The maximum requested order is determined by the callsite. E.g.:
> +	 * - PMD-mapped THP uses PMD_ORDER
> +	 * - mTHP uses (PMD_ORDER - 1)
> +	 *
> +	 * We must respect this upper bound to avoid undefined behavior. So the
> +	 * highest suggested order can't exceed the highest requested order.
> +	 */
> +	if (thp_order <= highest_order(orders))
> +		thp_orders = BIT(thp_order);

OK so looking at Lance's reply re: setting 0 and what we're doing here in
general - this seems a bit weird to me.

Shouldn't orders be specifying a _mask_ as to which orders are _available_,
rather than allowing a user to specify an arbitrary order?

So if you're a position where the only possible order is PMD sized, now this
would let you arbitrarily select an mTHP right? That does no seem correct.

And as per Lance, if we cannot satisfy the requested order, we shouldn't fall
back to available orders, we should take that as a signal that we cannot have
THP at all.

So shouldn't this just be:

	thp_orders = orders & BIT(thp_order);

? Or am I missing something here?

> +
> +out:
> +	rcu_read_unlock();
> +	return thp_orders;
> +}

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 03/10] mm: thp: decouple THP allocation between swap and page fault paths
  2025-09-10  2:44 ` [PATCH v7 mm-new 03/10] mm: thp: decouple THP allocation between swap and page fault paths Yafang Shao
@ 2025-09-11 14:55   ` Lorenzo Stoakes
  2025-09-12  7:20     ` Yafang Shao
  0 siblings, 1 reply; 61+ messages in thread
From: Lorenzo Stoakes @ 2025-09-11 14:55 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, 21cnbao,
	shakeel.butt, bpf, linux-mm, linux-doc

On Wed, Sep 10, 2025 at 10:44:40AM +0800, Yafang Shao wrote:
> The new BPF capability enables finer-grained THP policy decisions by
> introducing separate handling for swap faults versus normal page faults.
>
> As highlighted by Barry:
>
>   We’ve observed that swapping in large folios can lead to more
>   swap thrashing for some workloads- e.g. kernel build. Consequently,
>   some workloads might prefer swapping in smaller folios than those
>   allocated by alloc_anon_folio().
>
> While prtcl() could potentially be extended to leverage this new policy,
> doing so would require modifications to the uAPI.
>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>

Other than nits, these seems fine, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> Cc: Barry Song <21cnbao@gmail.com>
> ---
>  include/linux/huge_mm.h | 3 ++-
>  mm/huge_memory.c        | 2 +-
>  mm/memory.c             | 2 +-
>  3 files changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index f72a5fd04e4f..b9742453806f 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -97,9 +97,10 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
>
>  enum tva_type {
>  	TVA_SMAPS,		/* Exposing "THPeligible:" in smaps. */
> -	TVA_PAGEFAULT,		/* Serving a page fault. */
> +	TVA_PAGEFAULT,		/* Serving a non-swap page fault. */
>  	TVA_KHUGEPAGED,		/* Khugepaged collapse. */
>  	TVA_FORCED_COLLAPSE,	/* Forced collapse (e.g. MADV_COLLAPSE). */
> +	TVA_SWAP,		/* Serving a swap */

Serving a swap what? :) I think TVA_SWAP_PAGEFAULT would be better here right?
And 'serving a swap page fault'.

>  };
>
>  #define thp_vma_allowable_order(vma, vm_flags, type, order) \
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 26cedfcd7418..523153d21a41 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -103,7 +103,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>  					 unsigned long orders)
>  {
>  	const bool smaps = type == TVA_SMAPS;
> -	const bool in_pf = type == TVA_PAGEFAULT;
> +	const bool in_pf = (type == TVA_PAGEFAULT || type == TVA_SWAP);
>  	const bool forced_collapse = type == TVA_FORCED_COLLAPSE;
>  	unsigned long supported_orders;
>
> diff --git a/mm/memory.c b/mm/memory.c
> index d9de6c056179..d8819cac7930 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4515,7 +4515,7 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>  	 * Get a list of all the (large) orders below PMD_ORDER that are enabled
>  	 * and suitable for swapping THP.
>  	 */
> -	orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_PAGEFAULT,
> +	orders = thp_vma_allowable_orders(vma, vma->vm_flags, TVA_SWAP,
>  					  BIT(PMD_ORDER) - 1);
>  	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
>  	orders = thp_swap_suitable_orders(swp_offset(entry),
> --
> 2.47.3
>


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 02/10] mm: thp: add support for BPF based THP order selection
  2025-09-11 14:42       ` Lance Yang
@ 2025-09-11 14:58         ` Lorenzo Stoakes
  2025-09-12  7:58           ` Yafang Shao
  0 siblings, 1 reply; 61+ messages in thread
From: Lorenzo Stoakes @ 2025-09-11 14:58 UTC (permalink / raw)
  To: Lance Yang
  Cc: Yafang Shao, akpm, david, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
	willy, ast, daniel, andrii, ameryhung, rientjes, corbet, 21cnbao,
	shakeel.butt, bpf, linux-mm, linux-doc

On Thu, Sep 11, 2025 at 10:42:26PM +0800, Lance Yang wrote:
>
>
> On 2025/9/11 22:02, Lorenzo Stoakes wrote:
> > On Wed, Sep 10, 2025 at 08:42:37PM +0800, Lance Yang wrote:
> > > Hey Yafang,
> > >
> > > On Wed, Sep 10, 2025 at 10:53 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
> > > > THP tuning. It includes a hook bpf_hook_thp_get_order(), allowing BPF
> > > > programs to influence THP order selection based on factors such as:
> > > > - Workload identity
> > > >    For example, workloads running in specific containers or cgroups.
> > > > - Allocation context
> > > >    Whether the allocation occurs during a page fault, khugepaged, swap or
> > > >    other paths.
> > > > - VMA's memory advice settings
> > > >    MADV_HUGEPAGE or MADV_NOHUGEPAGE
> > > > - Memory pressure
> > > >    PSI system data or associated cgroup PSI metrics
> > > >
> > > > The kernel API of this new BPF hook is as follows,
> > > >
> > > > /**
> > > >   * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation
> > > >   * @vma: vm_area_struct associated with the THP allocation
> > > >   * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
> > > >   *            BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if
> > > >   *            neither is set.
> > > >   * @tva_type: TVA type for current @vma
> > > >   * @orders: Bitmask of requested THP orders for this allocation
> > > >   *          - PMD-mapped allocation if PMD_ORDER is set
> > > >   *          - mTHP allocation otherwise
> > > >   *
> > > >   * Return: The suggested THP order from the BPF program for allocation. It will
> > > >   *         not exceed the highest requested order in @orders. Return -1 to
> > > >   *         indicate that the original requested @orders should remain unchanged.
> > > >   */
> > > > typedef int thp_order_fn_t(struct vm_area_struct *vma,
> > > >                             enum bpf_thp_vma_type vma_type,
> > > >                             enum tva_type tva_type,
> > > >                             unsigned long orders);
> > > >
> > > > Only a single BPF program can be attached at any given time, though it can
> > > > be dynamically updated to adjust the policy. The implementation supports
> > > > anonymous THP, shmem THP, and mTHP, with future extensions planned for
> > > > file-backed THP.
> > > >
> > > > This functionality is only active when system-wide THP is configured to
> > > > madvise or always mode. It remains disabled in never mode. Additionally,
> > > > if THP is explicitly disabled for a specific task via prctl(), this BPF
> > > > functionality will also be unavailable for that task.
> > > >
> > > > This feature requires CONFIG_BPF_GET_THP_ORDER (marked EXPERIMENTAL) to be
> > > > enabled. Note that this capability is currently unstable and may undergo
> > > > significant changes—including potential removal—in future kernel versions.
> > > >
> > > > Suggested-by: David Hildenbrand <david@redhat.com>
> > > > Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > > > ---
> > > [...]
> > > > diff --git a/mm/huge_memory_bpf.c b/mm/huge_memory_bpf.c
> > > > new file mode 100644
> > > > index 000000000000..525ee22ab598
> > > > --- /dev/null
> > > > +++ b/mm/huge_memory_bpf.c
> > > > @@ -0,0 +1,243 @@
> > > > +// SPDX-License-Identifier: GPL-2.0
> > > > +/*
> > > > + * BPF-based THP policy management
> > > > + *
> > > > + * Author: Yafang Shao <laoar.shao@gmail.com>
> > > > + */
> > > > +
> > > > +#include <linux/bpf.h>
> > > > +#include <linux/btf.h>
> > > > +#include <linux/huge_mm.h>
> > > > +#include <linux/khugepaged.h>
> > > > +
> > > > +enum bpf_thp_vma_type {
> > > > +       BPF_THP_VM_NONE = 0,
> > > > +       BPF_THP_VM_HUGEPAGE,    /* VM_HUGEPAGE */
> > > > +       BPF_THP_VM_NOHUGEPAGE,  /* VM_NOHUGEPAGE */
> > > > +};
> > > > +
> > > > +/**
> > > > + * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation
> > > > + * @vma: vm_area_struct associated with the THP allocation
> > > > + * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
> > > > + *            BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if
> > > > + *            neither is set.
> > > > + * @tva_type: TVA type for current @vma
> > > > + * @orders: Bitmask of requested THP orders for this allocation
> > > > + *          - PMD-mapped allocation if PMD_ORDER is set
> > > > + *          - mTHP allocation otherwise
> > > > + *
> > > > + * Return: The suggested THP order from the BPF program for allocation. It will
> > > > + *         not exceed the highest requested order in @orders. Return -1 to
> > > > + *         indicate that the original requested @orders should remain unchanged.
> > >
> > > A minor documentation nit: the comment says "Return -1 to indicate that the
> > > original requested @orders should remain unchanged". It might be slightly
> > > clearer to say "Return a negative value to fall back to the original
> > > behavior". This would cover all error codes as well ;)
> > >
> > > > + */
> > > > +typedef int thp_order_fn_t(struct vm_area_struct *vma,
> > > > +                          enum bpf_thp_vma_type vma_type,
> > > > +                          enum tva_type tva_type,
> > > > +                          unsigned long orders);
> > >
> > > Sorry if I'm missing some context here since I haven't tracked the whole
> > > series closely.
> > >
> > > Regarding the return value for thp_order_fn_t: right now it returns a
> > > single int order. I was thinking, what if we let it return an unsigned
> > > long bitmask of orders instead? This seems like it would be more flexible
> > > down the road, especially if we get more mTHP sizes to choose from. It
> > > would also make the API more consistent, as bpf_hook_thp_get_orders()
> > > itself returns an unsigned long ;)
> >
> > I think that adds confusion - as in how an order might be chosen from
> > those. Also we have _received_ a bitmap of available orders - and the intent
> > here is to select _which one we should use_.
>
> Yep. Makes sense to me ;)

Thanks :)

>
> >
> > And this is an experimental feature, behind a flag explicitly labelled as
> > experimental (and thus subject to change) so if we found we needed to change
> > things in the future we can.
>
> You're right, I didn't pay enough attention to the fact that this is
> an experimental feature. So my suggestions were based on a lack of
> context ...

It's fine, don't worry :) these are sensible suggestions - it to me highlights
that we haven't been clear enough perhaps.

>
> >
> > >
> > > Also, for future extensions, it might be a good idea to add a reserved
> > > flags argument to the thp_order_fn_t signature.
> >
> > We don't need to do anything like this, as we are behind an experimental flag
> > and in no way guarantee that this will be used this way going forwards.
> > >
> > > For example thp_order_fn_t(..., unsigned long flags).
> > >
> > > This would give us aforward-compatible way to add new semantics later
> > > without breaking the ABI and needing a v2. We could just require it to be
> > > 0 for now.
> >
> > There is no ABI.
> >
> > I mean again to emphasise, this is an _experimental_ feature not to be relied
> > upon in production.
> >
> > >
> > > Thanks for the great work!
> > > Lance
> >
> > Perhaps we need to put a 'EXPERIMENTAL_' prefix on the config flag too to really
> > bring this home, as it's perhaps not all that clear :)
>
> No need for a 'EXPERIMENTAL_' prefix, it was just me missing
> the background. Appreciate you clarifying this!

Don't worry about it, but also it suggests that we probably need to be
ultra-super clear to users in general. So I think an _EXPERIMENTAL suffix is
probably pretty valid here just to _hammer home_ that - hey - we might break
you! :)

>
> Cheers,
> Lance
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 04/10] mm: thp: enable THP allocation exclusively through khugepaged
  2025-09-10  2:44 ` [PATCH v7 mm-new 04/10] mm: thp: enable THP allocation exclusively through khugepaged Yafang Shao
@ 2025-09-11 15:53   ` Lance Yang
  2025-09-12  6:21     ` Yafang Shao
  2025-09-11 15:58   ` Lorenzo Stoakes
  1 sibling, 1 reply; 61+ messages in thread
From: Lance Yang @ 2025-09-11 15:53 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, 21cnbao, shakeel.butt, bpf, linux-mm, linux-doc

On Wed, Sep 10, 2025 at 11:00 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> Currently, THP allocation cannot be restricted to khugepaged alone while
> being disabled in the page fault path. This limitation exists because
> disabling THP allocation during page faults also prevents the execution of
> khugepaged_enter_vma() in that path.
>
> With the introduction of BPF, we can now implement THP policies based on
> different TVA types. This patch adjusts the logic to support this new
> capability.
>
> While we could also extend prtcl() to utilize this new policy, such a
> change would require a uAPI modification.
>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  mm/huge_memory.c |  1 -
>  mm/memory.c      | 13 ++++++++-----
>  2 files changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 523153d21a41..1e9e7b32e2cf 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1346,7 +1346,6 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>         ret = vmf_anon_prepare(vmf);
>         if (ret)
>                 return ret;
> -       khugepaged_enter_vma(vma, vma->vm_flags);
>
>         if (!(vmf->flags & FAULT_FLAG_WRITE) &&
>                         !mm_forbids_zeropage(vma->vm_mm) &&
> diff --git a/mm/memory.c b/mm/memory.c
> index d8819cac7930..d0609dc1e371 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6289,11 +6289,14 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
>         if (pud_trans_unstable(vmf.pud))
>                 goto retry_pud;
>
> -       if (pmd_none(*vmf.pmd) &&
> -           thp_vma_allowable_order(vma, vm_flags, TVA_PAGEFAULT, PMD_ORDER)) {
> -               ret = create_huge_pmd(&vmf);
> -               if (!(ret & VM_FAULT_FALLBACK))
> -                       return ret;
> +       if (pmd_none(*vmf.pmd)) {
> +               if (vma_is_anonymous(vma))
> +                       khugepaged_enter_vma(vma, vm_flags);

Hmm... I'm a bit confused about the different conditions for calling
khugepaged_enter_vma(). It's sometimes called for anonymous VMAs, other
times ONLY for non-anonymous, and sometimes unconditionally ;)

Anyway, this isn't a blocker, just something I noticed. I might try to
simplify that down the road.

Acked-by: Lance Yang <lance.yang@linux.dev>

Cheers,
Lance

> +               if (thp_vma_allowable_order(vma, vm_flags, TVA_PAGEFAULT, PMD_ORDER)) {
> +                       ret = create_huge_pmd(&vmf);
> +                       if (!(ret & VM_FAULT_FALLBACK))
> +                               return ret;
> +               }
>         } else {
>                 vmf.orig_pmd = pmdp_get_lockless(vmf.pmd);
>
> --
> 2.47.3
>
>


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 04/10] mm: thp: enable THP allocation exclusively through khugepaged
  2025-09-10  2:44 ` [PATCH v7 mm-new 04/10] mm: thp: enable THP allocation exclusively through khugepaged Yafang Shao
  2025-09-11 15:53   ` Lance Yang
@ 2025-09-11 15:58   ` Lorenzo Stoakes
  2025-09-12  6:17     ` Yafang Shao
  1 sibling, 1 reply; 61+ messages in thread
From: Lorenzo Stoakes @ 2025-09-11 15:58 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, 21cnbao,
	shakeel.butt, bpf, linux-mm, linux-doc

On Wed, Sep 10, 2025 at 10:44:41AM +0800, Yafang Shao wrote:
> Currently, THP allocation cannot be restricted to khugepaged alone while
> being disabled in the page fault path. This limitation exists because
> disabling THP allocation during page faults also prevents the execution of
> khugepaged_enter_vma() in that path.

This is quite confusing, I see what you mean - you want to be able to disable
page fault THP but not khugepaged THP _at the point of possibly faulting in a
THP aligned VMA_.

It seems this patch makes khugepaged_enter_vma() unconditional for an anonymous
VMA, rather than depending on the return value specified by
thp_vma_allowable_order().

So I think a clearer explanation is:

	khugepaged_enter_vma() ultimately invokes any attached BPF function with
	the TVA_KHUGEPAGED flag set when determining whether or not to enable
	khugepaged THP for a freshly faulted in VMA.

	Currently, on fault, we invoke this in do_huge_pmd_anonymous_page(), as
	invoked by create_huge_pmd() and only when we have already checked to
	see if an allowable TVA_PAGEFAULT order is specified.

	Since we might want to disallow THP on fault-in but allow it via
	khugepaged, we move things around so we always attempt to enter
	khugepaged upon fault.

Having said all this, I'm very confused.

Why are we doing this?

We only enable khugepaged _early_ when we know we're faulting in a huge PMD
here.

I guess we do this because, if we are allowed to do the pagefault, maybe
something changed that might have previously disallowed khugepaged to run for
the mm.

But now we're just checking unconditionally for... no reason?

if BPF disables page fault but not khugepaged, then surely the mm would already
be under be khugepaged if it could be?

It's sort of immaterial if we get a pmd_none() that is not-faultable for
whatever reason but BPF might say is khugepaged'able, because it'd have already
set this.

This is because if we just map a new VMA, we already let khugepaged have it via
khugepaged_enter_vma() in __mmap_new_vma() and in the merge paths.

I mean maybe I'm missing something here :)

>
> With the introduction of BPF, we can now implement THP policies based on
> different TVA types. This patch adjusts the logic to support this new
> capability.
>
> While we could also extend prtcl() to utilize this new policy, such a

Typo: prtcl -> prctl

> change would require a uAPI modification.

Hm, in what respect? PR_SET_THP_DISABLE?

>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  mm/huge_memory.c |  1 -
>  mm/memory.c      | 13 ++++++++-----
>  2 files changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 523153d21a41..1e9e7b32e2cf 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1346,7 +1346,6 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
>  	ret = vmf_anon_prepare(vmf);
>  	if (ret)
>  		return ret;
> -	khugepaged_enter_vma(vma, vma->vm_flags);
>
>  	if (!(vmf->flags & FAULT_FLAG_WRITE) &&
>  			!mm_forbids_zeropage(vma->vm_mm) &&
> diff --git a/mm/memory.c b/mm/memory.c
> index d8819cac7930..d0609dc1e371 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -6289,11 +6289,14 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
>  	if (pud_trans_unstable(vmf.pud))
>  		goto retry_pud;
>
> -	if (pmd_none(*vmf.pmd) &&
> -	    thp_vma_allowable_order(vma, vm_flags, TVA_PAGEFAULT, PMD_ORDER)) {
> -		ret = create_huge_pmd(&vmf);
> -		if (!(ret & VM_FAULT_FALLBACK))
> -			return ret;
> +	if (pmd_none(*vmf.pmd)) {
> +		if (vma_is_anonymous(vma))
> +			khugepaged_enter_vma(vma, vm_flags);
> +		if (thp_vma_allowable_order(vma, vm_flags, TVA_PAGEFAULT, PMD_ORDER)) {
> +			ret = create_huge_pmd(&vmf);
> +			if (!(ret & VM_FAULT_FALLBACK))
> +				return ret;
> +		}
>  	} else {
>  		vmf.orig_pmd = pmdp_get_lockless(vmf.pmd);
>
> --
> 2.47.3
>


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 05/10] bpf: mark mm->owner as __safe_rcu_or_null
  2025-09-10  2:44 ` [PATCH v7 mm-new 05/10] bpf: mark mm->owner as __safe_rcu_or_null Yafang Shao
@ 2025-09-11 16:04   ` Lorenzo Stoakes
  0 siblings, 0 replies; 61+ messages in thread
From: Lorenzo Stoakes @ 2025-09-11 16:04 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, 21cnbao,
	shakeel.butt, bpf, linux-mm, linux-doc

On Wed, Sep 10, 2025 at 10:44:42AM +0800, Yafang Shao wrote:
> When CONFIG_MEMCG is enabled, we can access mm->owner under RCU. The
> owner can be NULL. With this change, BPF helpers can safely access
> mm->owner to retrieve the associated task from the mm. We can then make
> policy decision based on the task attribute.
>
> The typical use case is as follows,
>
>   bpf_rcu_read_lock(); // rcu lock must be held for rcu trusted field
>   @owner = @mm->owner; // mm_struct::owner is rcu trusted or null
>   if (!@owner)
>       goto out;
>
>   /* Do something based on the task attribute */
>
> out:
>   bpf_rcu_read_unlock();
>
> Suggested-by: Andrii Nakryiko <andrii@kernel.org>

This is one for the BPF people, but this seems reasonable afaict so:

Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  kernel/bpf/verifier.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index c4f69a9e9af6..d400e18ee31e 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -7123,6 +7123,9 @@ BTF_TYPE_SAFE_RCU(struct cgroup_subsys_state) {
>  /* RCU trusted: these fields are trusted in RCU CS and can be NULL */
>  BTF_TYPE_SAFE_RCU_OR_NULL(struct mm_struct) {
>  	struct file __rcu *exe_file;
> +#ifdef CONFIG_MEMCG
> +	struct task_struct __rcu *owner;
> +#endif
>  };
>
>  /* skb->sk, req->sk are not RCU protected, but we mark them as such
> --
> 2.47.3
>


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 06/10] bpf: mark vma->vm_mm as __safe_trusted_or_null
  2025-09-10  2:44 ` [PATCH v7 mm-new 06/10] bpf: mark vma->vm_mm as __safe_trusted_or_null Yafang Shao
@ 2025-09-11 17:08   ` Lorenzo Stoakes
  2025-09-11 17:30   ` Liam R. Howlett
  1 sibling, 0 replies; 61+ messages in thread
From: Lorenzo Stoakes @ 2025-09-11 17:08 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, 21cnbao,
	shakeel.butt, bpf, linux-mm, linux-doc

On Wed, Sep 10, 2025 at 10:44:43AM +0800, Yafang Shao wrote:
> The vma->vm_mm might be NULL and it can be accessed outside of RCU. Thus,
> we can mark it as trusted_or_null. With this change, BPF helpers can safely
> access vma->vm_mm to retrieve the associated mm_struct from the VMA.
> Then we can make policy decision from the VMA.
>
> The lsm selftest must be modified because it directly accesses vma->vm_mm
> without a NULL pointer check; otherwise it will break due to this
> change.
>
> For the VMA based THP policy, the use case is as follows,
>
>   @mm = @vma->vm_mm; // vm_area_struct::vm_mm is trusted or null
>   if (!@mm)
>       return;
>   bpf_rcu_read_lock(); // rcu lock must be held to dereference the owner
>   @owner = @mm->owner; // mm_struct::owner is rcu trusted or null
>   if (!@owner)
>     goto out;
>   @cgroup1 = bpf_task_get_cgroup1(@owner, MEMCG_HIERARCHY_ID);
>
>   /* make the decision based on the @cgroup1 attribute */
>
>   bpf_cgroup_release(@cgroup1); // release the associated cgroup
> out:
>   bpf_rcu_read_unlock();
>
> PSI memory information can be obtained from the associated cgroup to inform
> policy decisions. Since upstream PSI support is currently limited to cgroup
> v2, the following example demonstrates cgroup v2 implementation:
>
>   @owner = @mm->owner;
>   if (@owner) {
>       // @ancestor_cgid is user-configured
>       @ancestor = bpf_cgroup_from_id(@ancestor_cgid);
>       if (bpf_task_under_cgroup(@owner, @ancestor)) {
>           @psi_group = @ancestor->psi;
>
>         /* Extract PSI metrics from @psi_group and
>          * implement policy logic based on the values
>          */
>
>       }
>   }
>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>

Again more for BPF guys, but seems sensible, so:

Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  kernel/bpf/verifier.c                   | 5 +++++
>  tools/testing/selftests/bpf/progs/lsm.c | 8 +++++---
>  2 files changed, 10 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index d400e18ee31e..b708b98f796c 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -7165,6 +7165,10 @@ BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct socket) {
>  	struct sock *sk;
>  };
>
> +BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct vm_area_struct) {
> +	struct mm_struct *vm_mm;
> +};
> +
>  static bool type_is_rcu(struct bpf_verifier_env *env,
>  			struct bpf_reg_state *reg,
>  			const char *field_name, u32 btf_id)
> @@ -7206,6 +7210,7 @@ static bool type_is_trusted_or_null(struct bpf_verifier_env *env,
>  {
>  	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct socket));
>  	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct dentry));
> +	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct vm_area_struct));
>
>  	return btf_nested_type_is_trusted(&env->log, reg, field_name, btf_id,
>  					  "__safe_trusted_or_null");
> diff --git a/tools/testing/selftests/bpf/progs/lsm.c b/tools/testing/selftests/bpf/progs/lsm.c
> index 0c13b7409947..7de173daf27b 100644
> --- a/tools/testing/selftests/bpf/progs/lsm.c
> +++ b/tools/testing/selftests/bpf/progs/lsm.c
> @@ -89,14 +89,16 @@ SEC("lsm/file_mprotect")
>  int BPF_PROG(test_int_hook, struct vm_area_struct *vma,
>  	     unsigned long reqprot, unsigned long prot, int ret)
>  {
> -	if (ret != 0)
> +	struct mm_struct *mm = vma->vm_mm;
> +
> +	if (ret != 0 || !mm)
>  		return ret;
>
>  	__s32 pid = bpf_get_current_pid_tgid() >> 32;
>  	int is_stack = 0;
>
> -	is_stack = (vma->vm_start <= vma->vm_mm->start_stack &&
> -		    vma->vm_end >= vma->vm_mm->start_stack);
> +	is_stack = (vma->vm_start <= mm->start_stack &&
> +		    vma->vm_end >= mm->start_stack);
>
>  	if (is_stack && monitored_pid == pid) {
>  		mprotect_count++;
> --
> 2.47.3
>


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 06/10] bpf: mark vma->vm_mm as __safe_trusted_or_null
  2025-09-10  2:44 ` [PATCH v7 mm-new 06/10] bpf: mark vma->vm_mm as __safe_trusted_or_null Yafang Shao
  2025-09-11 17:08   ` Lorenzo Stoakes
@ 2025-09-11 17:30   ` Liam R. Howlett
  2025-09-11 17:44     ` Lorenzo Stoakes
  2025-09-12  3:50     ` Yafang Shao
  1 sibling, 2 replies; 61+ messages in thread
From: Liam R. Howlett @ 2025-09-11 17:30 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, ziy, baolin.wang, lorenzo.stoakes, npache,
	ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
	willy, ast, daniel, andrii, ameryhung, rientjes, corbet, 21cnbao,
	shakeel.butt, bpf, linux-mm, linux-doc

* Yafang Shao <laoar.shao@gmail.com> [250909 22:46]:
> The vma->vm_mm might be NULL and it can be accessed outside of RCU. Thus,
> we can mark it as trusted_or_null. With this change, BPF helpers can safely
> access vma->vm_mm to retrieve the associated mm_struct from the VMA.
> Then we can make policy decision from the VMA.

I don't agree with any of that statement.

How are you getting a vma outside an rcu lock safely?

vmas are RCU type safe so I don't think you can make the statement of
null or trusted.  You can get a vma that has moved to another mm if you
are not careful.

What am I missing?  Surely there is more context to add to this commit
message.

> 
> The lsm selftest must be modified because it directly accesses vma->vm_mm
> without a NULL pointer check; otherwise it will break due to this
> change.
> 
> For the VMA based THP policy, the use case is as follows,
> 
>   @mm = @vma->vm_mm; // vm_area_struct::vm_mm is trusted or null
>   if (!@mm)
>       return;
>   bpf_rcu_read_lock(); // rcu lock must be held to dereference the owner
>   @owner = @mm->owner; // mm_struct::owner is rcu trusted or null
>   if (!@owner)
>     goto out;
>   @cgroup1 = bpf_task_get_cgroup1(@owner, MEMCG_HIERARCHY_ID);
> 
>   /* make the decision based on the @cgroup1 attribute */
> 
>   bpf_cgroup_release(@cgroup1); // release the associated cgroup
> out:
>   bpf_rcu_read_unlock();
> 
> PSI memory information can be obtained from the associated cgroup to inform
> policy decisions. Since upstream PSI support is currently limited to cgroup
> v2, the following example demonstrates cgroup v2 implementation:
> 
>   @owner = @mm->owner;
>   if (@owner) {
>       // @ancestor_cgid is user-configured
>       @ancestor = bpf_cgroup_from_id(@ancestor_cgid);
>       if (bpf_task_under_cgroup(@owner, @ancestor)) {
>           @psi_group = @ancestor->psi;
> 
>         /* Extract PSI metrics from @psi_group and
>          * implement policy logic based on the values
>          */
> 
>       }
>   }
> 
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  kernel/bpf/verifier.c                   | 5 +++++
>  tools/testing/selftests/bpf/progs/lsm.c | 8 +++++---
>  2 files changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index d400e18ee31e..b708b98f796c 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -7165,6 +7165,10 @@ BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct socket) {
>  	struct sock *sk;
>  };
>  
> +BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct vm_area_struct) {
> +	struct mm_struct *vm_mm;
> +};
> +
>  static bool type_is_rcu(struct bpf_verifier_env *env,
>  			struct bpf_reg_state *reg,
>  			const char *field_name, u32 btf_id)
> @@ -7206,6 +7210,7 @@ static bool type_is_trusted_or_null(struct bpf_verifier_env *env,
>  {
>  	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct socket));
>  	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct dentry));
> +	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct vm_area_struct));
>  
>  	return btf_nested_type_is_trusted(&env->log, reg, field_name, btf_id,
>  					  "__safe_trusted_or_null");
> diff --git a/tools/testing/selftests/bpf/progs/lsm.c b/tools/testing/selftests/bpf/progs/lsm.c
> index 0c13b7409947..7de173daf27b 100644
> --- a/tools/testing/selftests/bpf/progs/lsm.c
> +++ b/tools/testing/selftests/bpf/progs/lsm.c
> @@ -89,14 +89,16 @@ SEC("lsm/file_mprotect")
>  int BPF_PROG(test_int_hook, struct vm_area_struct *vma,
>  	     unsigned long reqprot, unsigned long prot, int ret)
>  {
> -	if (ret != 0)
> +	struct mm_struct *mm = vma->vm_mm;
> +
> +	if (ret != 0 || !mm)
>  		return ret;
>  
>  	__s32 pid = bpf_get_current_pid_tgid() >> 32;
>  	int is_stack = 0;
>  
> -	is_stack = (vma->vm_start <= vma->vm_mm->start_stack &&
> -		    vma->vm_end >= vma->vm_mm->start_stack);
> +	is_stack = (vma->vm_start <= mm->start_stack &&
> +		    vma->vm_end >= mm->start_stack);
>  
>  	if (is_stack && monitored_pid == pid) {
>  		mprotect_count++;
> -- 
> 2.47.3
> 
> 


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 06/10] bpf: mark vma->vm_mm as __safe_trusted_or_null
  2025-09-11 17:30   ` Liam R. Howlett
@ 2025-09-11 17:44     ` Lorenzo Stoakes
  2025-09-12  3:56       ` Yafang Shao
  2025-09-12  3:50     ` Yafang Shao
  1 sibling, 1 reply; 61+ messages in thread
From: Lorenzo Stoakes @ 2025-09-11 17:44 UTC (permalink / raw)
  To: Liam R. Howlett, Yafang Shao, akpm, david, ziy, baolin.wang,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, 21cnbao, shakeel.butt, bpf, linux-mm, linux-doc

On Thu, Sep 11, 2025 at 01:30:52PM -0400, Liam R. Howlett wrote:
> * Yafang Shao <laoar.shao@gmail.com> [250909 22:46]:
> > The vma->vm_mm might be NULL and it can be accessed outside of RCU. Thus,
> > we can mark it as trusted_or_null. With this change, BPF helpers can safely
> > access vma->vm_mm to retrieve the associated mm_struct from the VMA.
> > Then we can make policy decision from the VMA.
>
> I don't agree with any of that statement.
>
> How are you getting a vma outside an rcu lock safely?

I'm guessing he means that kernel code might access it outside of RCU?

vma->vm_mm can be NULL for 'special' mappings, no not that special, not the
other special, the VDSO special, yeah that one.

get_vma_name() in fs/proc/task_mmu.c does:

	if (!vma->vm_mm) {
		*name = "[vdso]";
		return;
	}

Not sure you'd ever find a way to bump into that in THP code paths though ofc.

I was reassured in the last version of the series that the MM is definitely safe
to access safe to access

E.g. https://lore.kernel.org/linux-mm/299e12dc-259b-45c2-8662-2f3863479939@lucifer.local/
https://lore.kernel.org/linux-mm/5fb8bd8d-cdd9-42e0-b62d-eb5a517a35c2@lucifer.local/

And it _seems_ BPF can already access VMA's.

I think everything's under RCU, and there's automatically an RCU lock applied
for anything BPF-ish.

So my A-b was all baed on this kind of hand waving...

>
> vmas are RCU type safe so I don't think you can make the statement of
> null or trusted.  You can get a vma that has moved to another mm if you
> are not careful.
>
> What am I missing?  Surely there is more context to add to this commit
> message.

Suspect it's the BPF-magic that's the confusing bit...

>
> >
> > The lsm selftest must be modified because it directly accesses vma->vm_mm
> > without a NULL pointer check; otherwise it will break due to this
> > change.
> >
> > For the VMA based THP policy, the use case is as follows,
> >
> >   @mm = @vma->vm_mm; // vm_area_struct::vm_mm is trusted or null
> >   if (!@mm)
> >       return;
> >   bpf_rcu_read_lock(); // rcu lock must be held to dereference the owner
> >   @owner = @mm->owner; // mm_struct::owner is rcu trusted or null
> >   if (!@owner)
> >     goto out;
> >   @cgroup1 = bpf_task_get_cgroup1(@owner, MEMCG_HIERARCHY_ID);
> >
> >   /* make the decision based on the @cgroup1 attribute */
> >
> >   bpf_cgroup_release(@cgroup1); // release the associated cgroup
> > out:
> >   bpf_rcu_read_unlock();
> >
> > PSI memory information can be obtained from the associated cgroup to inform
> > policy decisions. Since upstream PSI support is currently limited to cgroup
> > v2, the following example demonstrates cgroup v2 implementation:
> >
> >   @owner = @mm->owner;
> >   if (@owner) {
> >       // @ancestor_cgid is user-configured
> >       @ancestor = bpf_cgroup_from_id(@ancestor_cgid);
> >       if (bpf_task_under_cgroup(@owner, @ancestor)) {
> >           @psi_group = @ancestor->psi;
> >
> >         /* Extract PSI metrics from @psi_group and
> >          * implement policy logic based on the values
> >          */
> >
> >       }
> >   }
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> >  kernel/bpf/verifier.c                   | 5 +++++
> >  tools/testing/selftests/bpf/progs/lsm.c | 8 +++++---
> >  2 files changed, 10 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index d400e18ee31e..b708b98f796c 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -7165,6 +7165,10 @@ BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct socket) {
> >  	struct sock *sk;
> >  };
> >
> > +BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct vm_area_struct) {
> > +	struct mm_struct *vm_mm;
> > +};
> > +
> >  static bool type_is_rcu(struct bpf_verifier_env *env,
> >  			struct bpf_reg_state *reg,
> >  			const char *field_name, u32 btf_id)
> > @@ -7206,6 +7210,7 @@ static bool type_is_trusted_or_null(struct bpf_verifier_env *env,
> >  {
> >  	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct socket));
> >  	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct dentry));
> > +	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct vm_area_struct));
> >
> >  	return btf_nested_type_is_trusted(&env->log, reg, field_name, btf_id,
> >  					  "__safe_trusted_or_null");
> > diff --git a/tools/testing/selftests/bpf/progs/lsm.c b/tools/testing/selftests/bpf/progs/lsm.c
> > index 0c13b7409947..7de173daf27b 100644
> > --- a/tools/testing/selftests/bpf/progs/lsm.c
> > +++ b/tools/testing/selftests/bpf/progs/lsm.c
> > @@ -89,14 +89,16 @@ SEC("lsm/file_mprotect")
> >  int BPF_PROG(test_int_hook, struct vm_area_struct *vma,
> >  	     unsigned long reqprot, unsigned long prot, int ret)
> >  {
> > -	if (ret != 0)
> > +	struct mm_struct *mm = vma->vm_mm;
> > +
> > +	if (ret != 0 || !mm)
> >  		return ret;
> >
> >  	__s32 pid = bpf_get_current_pid_tgid() >> 32;
> >  	int is_stack = 0;
> >
> > -	is_stack = (vma->vm_start <= vma->vm_mm->start_stack &&
> > -		    vma->vm_end >= vma->vm_mm->start_stack);
> > +	is_stack = (vma->vm_start <= mm->start_stack &&
> > +		    vma->vm_end >= mm->start_stack);
> >
> >  	if (is_stack && monitored_pid == pid) {
> >  		mprotect_count++;
> > --
> > 2.47.3
> >
> >


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 06/10] bpf: mark vma->vm_mm as __safe_trusted_or_null
  2025-09-11 17:30   ` Liam R. Howlett
  2025-09-11 17:44     ` Lorenzo Stoakes
@ 2025-09-12  3:50     ` Yafang Shao
  1 sibling, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2025-09-12  3:50 UTC (permalink / raw)
  To: Liam R. Howlett, Yafang Shao, akpm, david, ziy, baolin.wang,
	lorenzo.stoakes, npache, ryan.roberts, dev.jain, hannes,
	usamaarif642, gutierrez.asier, willy, ast, daniel, andrii,
	ameryhung, rientjes, corbet, 21cnbao, shakeel.butt, bpf, linux-mm,
	linux-doc

On Fri, Sep 12, 2025 at 1:31 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Yafang Shao <laoar.shao@gmail.com> [250909 22:46]:
> > The vma->vm_mm might be NULL and it can be accessed outside of RCU. Thus,
> > we can mark it as trusted_or_null. With this change, BPF helpers can safely
> > access vma->vm_mm to retrieve the associated mm_struct from the VMA.
> > Then we can make policy decision from the VMA.
>
> I don't agree with any of that statement.
>
> How are you getting a vma outside an rcu lock safely?

The callers of this BPF hook guarantee that the provided
vm_area_struct pointer is safe to read. This means your BPF program
can safely access its members, though it cannot write to them.

You might question how code in lsm.c can access
vma->vm_mm->start_stack without an explicit NULL check. This is
because the BPF verifier has a safety feature: if vma->vm_mm is NULL,
it will substitute the value 0 instead of actually performing the
dereference, preventing a crash.

However, while this prevents a kernel panic, it doesn't guarantee
correct logic. If your program uses the value 0 for start_stack
without knowing it came from a NULL pointer, it might behave
incorrectly. Therefore, you must still explicitly check for NULL to
ensure your program's logic is sound.

The __safe_trusted_or_null marker enforces this requirement. It is a
restriction that ensures program correctness, not a loosening of the
rules.

Alex, Andrii, please correct me if my understanding is wrong.

>
> vmas are RCU type safe so I don't think you can make the statement of
> null or trusted.  You can get a vma that has moved to another mm if you
> are not careful.
>
> What am I missing?  Surely there is more context to add to this commit
> message.

According to the definition of struct vm_area_struct, the comment on
vm_mm states: "Unstable RCU readers are allowed to read this." This
confirms that we can safely read vm_mm without holding the RCU read
lock. If this were not the case, the comment would need to be
corrected.

  struct vm_area_struct {
         /*
         * The address space we belong to.
         * Unstable RCU readers are allowed to read this.
         */
         struct mm_struct *vm_mm;
  };

As a minor, unrelated note: Non-sleepable BPF programs always run
within an RCU read-side critical section. Therefore, you do not need
to explicitly acquire the RCU read lock in such programs.

-- 
Regards
Yafang

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 06/10] bpf: mark vma->vm_mm as __safe_trusted_or_null
  2025-09-11 17:44     ` Lorenzo Stoakes
@ 2025-09-12  3:56       ` Yafang Shao
  0 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2025-09-12  3:56 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Liam R. Howlett, akpm, david, ziy, baolin.wang, npache,
	ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
	willy, ast, daniel, andrii, ameryhung, rientjes, corbet, 21cnbao,
	shakeel.butt, bpf, linux-mm, linux-doc

On Fri, Sep 12, 2025 at 1:44 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Thu, Sep 11, 2025 at 01:30:52PM -0400, Liam R. Howlett wrote:
> > * Yafang Shao <laoar.shao@gmail.com> [250909 22:46]:
> > > The vma->vm_mm might be NULL and it can be accessed outside of RCU. Thus,
> > > we can mark it as trusted_or_null. With this change, BPF helpers can safely
> > > access vma->vm_mm to retrieve the associated mm_struct from the VMA.
> > > Then we can make policy decision from the VMA.
> >
> > I don't agree with any of that statement.
> >
> > How are you getting a vma outside an rcu lock safely?
>
> I'm guessing he means that kernel code might access it outside of RCU?
>
> vma->vm_mm can be NULL for 'special' mappings, no not that special, not the
> other special, the VDSO special, yeah that one.
>
> get_vma_name() in fs/proc/task_mmu.c does:
>
>         if (!vma->vm_mm) {
>                 *name = "[vdso]";
>                 return;
>         }
>
> Not sure you'd ever find a way to bump into that in THP code paths though ofc.
>
> I was reassured in the last version of the series that the MM is definitely safe
> to access safe to access
>
> E.g. https://lore.kernel.org/linux-mm/299e12dc-259b-45c2-8662-2f3863479939@lucifer.local/
> https://lore.kernel.org/linux-mm/5fb8bd8d-cdd9-42e0-b62d-eb5a517a35c2@lucifer.local/
>
> And it _seems_ BPF can already access VMA's.
>
> I think everything's under RCU, and there's automatically an RCU lock applied
> for anything BPF-ish.

This is true for non-sleepable BPF programs. However, for sleepable
BPF programs, only SRCU protection is held.

>
> So my A-b was all baed on this kind of hand waving...
>
> >
> > vmas are RCU type safe so I don't think you can make the statement of
> > null or trusted.  You can get a vma that has moved to another mm if you
> > are not careful.
> >
> > What am I missing?  Surely there is more context to add to this commit
> > message.
>
> Suspect it's the BPF-magic that's the confusing bit...

Absolutely. BPF has a lot of magic under the hood ;-)

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 04/10] mm: thp: enable THP allocation exclusively through khugepaged
  2025-09-11 15:58   ` Lorenzo Stoakes
@ 2025-09-12  6:17     ` Yafang Shao
  2025-09-12 13:48       ` Lorenzo Stoakes
  0 siblings, 1 reply; 61+ messages in thread
From: Yafang Shao @ 2025-09-12  6:17 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, 21cnbao,
	shakeel.butt, bpf, linux-mm, linux-doc

On Thu, Sep 11, 2025 at 11:58 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Wed, Sep 10, 2025 at 10:44:41AM +0800, Yafang Shao wrote:
> > Currently, THP allocation cannot be restricted to khugepaged alone while
> > being disabled in the page fault path. This limitation exists because
> > disabling THP allocation during page faults also prevents the execution of
> > khugepaged_enter_vma() in that path.
>
> This is quite confusing, I see what you mean - you want to be able to disable
> page fault THP but not khugepaged THP _at the point of possibly faulting in a
> THP aligned VMA_.
>
> It seems this patch makes khugepaged_enter_vma() unconditional for an anonymous
> VMA, rather than depending on the return value specified by
> thp_vma_allowable_order().

The functions thp_vma_allowable_order(TVA_PAGEFAULT) and
thp_vma_allowable_order(TVA_KHUGEPAGED) are functionally equivalent
within the page fault handler; they always yield the same result.
Consequently, their execution order is irrelevant.

The change reorders these two calls and, in doing so, also moves the
call to vmf_anon_prepare(vmf). This alters the control flow:
- before this change:  The logic checked the return value of
vmf_anon_prepare() between the two thp_vma_allowable_order() calls.

    thp_vma_allowable_order(TVA_PAGEFAULT);
    ret = vmf_anon_prepare(vmf);
    if (ret)
        return ret;
    thp_vma_allowable_order(TVA_KHUGEPAGED);

 - after this change: The logic now executes both
thp_vma_allowable_order() calls first and does not check the return
value of vmf_anon_prepare().

    thp_vma_allowable_order(TVA_KHUGEPAGED);
    thp_vma_allowable_order(TVA_PAGEFAULT);
    ret = vmf_anon_prepare(vmf); // Return value 'ret' is ignored.

This change is safe because the return value of vmf_anon_prepare() can
be safely ignored. This function checks for transient system-level
conditions (e.g., memory pressure, THP availability) that might
prevent an immediate THP allocation. It does not guarantee that a
subsequent allocation will succeed.

This behavior is consistent with the policy in hugepage_madvise(),
where a VMA is queued for khugepaged before a definitive allocation
check. If the system is under pressure, khugepaged will simply retry
the allocation at a more opportune time.

>
> So I think a clearer explanation is:
>
>         khugepaged_enter_vma() ultimately invokes any attached BPF function with
>         the TVA_KHUGEPAGED flag set when determining whether or not to enable
>         khugepaged THP for a freshly faulted in VMA.
>
>         Currently, on fault, we invoke this in do_huge_pmd_anonymous_page(), as
>         invoked by create_huge_pmd() and only when we have already checked to
>         see if an allowable TVA_PAGEFAULT order is specified.
>
>         Since we might want to disallow THP on fault-in but allow it via
>         khugepaged, we move things around so we always attempt to enter
>         khugepaged upon fault.

Thanks for the clarification.

>
> Having said all this, I'm very confused.
>
> Why are we doing this?
>
> We only enable khugepaged _early_ when we know we're faulting in a huge PMD
> here.
>
> I guess we do this because, if we are allowed to do the pagefault, maybe
> something changed that might have previously disallowed khugepaged to run for
> the mm.
>
> But now we're just checking unconditionally for... no reason?

I have blamed the change history of do_huge_pmd_anonymous_page() but
was unable to find any rationale for placing khugepaged_enter_vma()
after the vmf_anon_prepare() check. I therefore believe this ordering
is likely unintentional.

>
> if BPF disables page fault but not khugepaged, then surely the mm would already
> be under be khugepaged if it could be?

The behavior you describe applies to the madvise mode, not the always
mode. To reiterate: the hugepage_madvise() function unconditionally
adds the memory mm to the khugepaged queue, whereas the page fault
handler employs conditional logic.

>
> It's sort of immaterial if we get a pmd_none() that is not-faultable for
> whatever reason but BPF might say is khugepaged'able, because it'd have already
> set this.
>
> This is because if we just map a new VMA, we already let khugepaged have it via
> khugepaged_enter_vma() in __mmap_new_vma() and in the merge paths.
>
> I mean maybe I'm missing something here :)
>
> >
> > With the introduction of BPF, we can now implement THP policies based on
> > different TVA types. This patch adjusts the logic to support this new
> > capability.
> >
> > While we could also extend prtcl() to utilize this new policy, such a
>
> Typo: prtcl -> prctl

thanks

>
> > change would require a uAPI modification.
>
> Hm, in what respect? PR_SET_THP_DISABLE?

Right, when can extend PR_SET_THP_DISABLE() to support this logic as well.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 04/10] mm: thp: enable THP allocation exclusively through khugepaged
  2025-09-11 15:53   ` Lance Yang
@ 2025-09-12  6:21     ` Yafang Shao
  0 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2025-09-12  6:21 UTC (permalink / raw)
  To: Lance Yang
  Cc: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, 21cnbao, shakeel.butt, bpf, linux-mm, linux-doc

On Thu, Sep 11, 2025 at 11:54 PM Lance Yang <lance.yang@linux.dev> wrote:
>
> On Wed, Sep 10, 2025 at 11:00 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > Currently, THP allocation cannot be restricted to khugepaged alone while
> > being disabled in the page fault path. This limitation exists because
> > disabling THP allocation during page faults also prevents the execution of
> > khugepaged_enter_vma() in that path.
> >
> > With the introduction of BPF, we can now implement THP policies based on
> > different TVA types. This patch adjusts the logic to support this new
> > capability.
> >
> > While we could also extend prtcl() to utilize this new policy, such a
> > change would require a uAPI modification.
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  mm/huge_memory.c |  1 -
> >  mm/memory.c      | 13 ++++++++-----
> >  2 files changed, 8 insertions(+), 6 deletions(-)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 523153d21a41..1e9e7b32e2cf 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1346,7 +1346,6 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
> >         ret = vmf_anon_prepare(vmf);
> >         if (ret)
> >                 return ret;
> > -       khugepaged_enter_vma(vma, vma->vm_flags);
> >
> >         if (!(vmf->flags & FAULT_FLAG_WRITE) &&
> >                         !mm_forbids_zeropage(vma->vm_mm) &&
> > diff --git a/mm/memory.c b/mm/memory.c
> > index d8819cac7930..d0609dc1e371 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -6289,11 +6289,14 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
> >         if (pud_trans_unstable(vmf.pud))
> >                 goto retry_pud;
> >
> > -       if (pmd_none(*vmf.pmd) &&
> > -           thp_vma_allowable_order(vma, vm_flags, TVA_PAGEFAULT, PMD_ORDER)) {
> > -               ret = create_huge_pmd(&vmf);
> > -               if (!(ret & VM_FAULT_FALLBACK))
> > -                       return ret;
> > +       if (pmd_none(*vmf.pmd)) {
> > +               if (vma_is_anonymous(vma))
> > +                       khugepaged_enter_vma(vma, vm_flags);
>
> Hmm... I'm a bit confused about the different conditions for calling
> khugepaged_enter_vma(). It's sometimes called for anonymous VMAs, other
> times ONLY for non-anonymous, and sometimes unconditionally ;)

Right, it is really confusing.

>
> Anyway, this isn't a blocker, just something I noticed. I might try to
> simplify that down the road.

please do it when you have a moment.

>
> Acked-by: Lance Yang <lance.yang@linux.dev>

Thanks for the review.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 03/10] mm: thp: decouple THP allocation between swap and page fault paths
  2025-09-11 14:55   ` Lorenzo Stoakes
@ 2025-09-12  7:20     ` Yafang Shao
  2025-09-12 12:04       ` Lorenzo Stoakes
  0 siblings, 1 reply; 61+ messages in thread
From: Yafang Shao @ 2025-09-12  7:20 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, 21cnbao,
	shakeel.butt, bpf, linux-mm, linux-doc

On Thu, Sep 11, 2025 at 10:56 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Wed, Sep 10, 2025 at 10:44:40AM +0800, Yafang Shao wrote:
> > The new BPF capability enables finer-grained THP policy decisions by
> > introducing separate handling for swap faults versus normal page faults.
> >
> > As highlighted by Barry:
> >
> >   We’ve observed that swapping in large folios can lead to more
> >   swap thrashing for some workloads- e.g. kernel build. Consequently,
> >   some workloads might prefer swapping in smaller folios than those
> >   allocated by alloc_anon_folio().
> >
> > While prtcl() could potentially be extended to leverage this new policy,
> > doing so would require modifications to the uAPI.
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
>
> Other than nits, these seems fine, so:
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> > Cc: Barry Song <21cnbao@gmail.com>
> > ---
> >  include/linux/huge_mm.h | 3 ++-
> >  mm/huge_memory.c        | 2 +-
> >  mm/memory.c             | 2 +-
> >  3 files changed, 4 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index f72a5fd04e4f..b9742453806f 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -97,9 +97,10 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
> >
> >  enum tva_type {
> >       TVA_SMAPS,              /* Exposing "THPeligible:" in smaps. */
> > -     TVA_PAGEFAULT,          /* Serving a page fault. */
> > +     TVA_PAGEFAULT,          /* Serving a non-swap page fault. */
> >       TVA_KHUGEPAGED,         /* Khugepaged collapse. */
> >       TVA_FORCED_COLLAPSE,    /* Forced collapse (e.g. MADV_COLLAPSE). */
> > +     TVA_SWAP,               /* Serving a swap */
>
> Serving a swap what? :) I think TVA_SWAP_PAGEFAULT would be better here right?
> And 'serving a swap page fault'.

will change it. Thanks for your suggestion.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 02/10] mm: thp: add support for BPF based THP order selection
  2025-09-11 14:58         ` Lorenzo Stoakes
@ 2025-09-12  7:58           ` Yafang Shao
  2025-09-12 12:04             ` Lorenzo Stoakes
  0 siblings, 1 reply; 61+ messages in thread
From: Yafang Shao @ 2025-09-12  7:58 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Lance Yang, akpm, david, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
	willy, ast, daniel, andrii, ameryhung, rientjes, corbet, 21cnbao,
	shakeel.butt, bpf, linux-mm, linux-doc

On Thu, Sep 11, 2025 at 10:58 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Thu, Sep 11, 2025 at 10:42:26PM +0800, Lance Yang wrote:
> >
> >
> > On 2025/9/11 22:02, Lorenzo Stoakes wrote:
> > > On Wed, Sep 10, 2025 at 08:42:37PM +0800, Lance Yang wrote:
> > > > Hey Yafang,
> > > >
> > > > On Wed, Sep 10, 2025 at 10:53 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > > >
> > > > > This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
> > > > > THP tuning. It includes a hook bpf_hook_thp_get_order(), allowing BPF
> > > > > programs to influence THP order selection based on factors such as:
> > > > > - Workload identity
> > > > >    For example, workloads running in specific containers or cgroups.
> > > > > - Allocation context
> > > > >    Whether the allocation occurs during a page fault, khugepaged, swap or
> > > > >    other paths.
> > > > > - VMA's memory advice settings
> > > > >    MADV_HUGEPAGE or MADV_NOHUGEPAGE
> > > > > - Memory pressure
> > > > >    PSI system data or associated cgroup PSI metrics
> > > > >
> > > > > The kernel API of this new BPF hook is as follows,
> > > > >
> > > > > /**
> > > > >   * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation
> > > > >   * @vma: vm_area_struct associated with the THP allocation
> > > > >   * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
> > > > >   *            BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if
> > > > >   *            neither is set.
> > > > >   * @tva_type: TVA type for current @vma
> > > > >   * @orders: Bitmask of requested THP orders for this allocation
> > > > >   *          - PMD-mapped allocation if PMD_ORDER is set
> > > > >   *          - mTHP allocation otherwise
> > > > >   *
> > > > >   * Return: The suggested THP order from the BPF program for allocation. It will
> > > > >   *         not exceed the highest requested order in @orders. Return -1 to
> > > > >   *         indicate that the original requested @orders should remain unchanged.
> > > > >   */
> > > > > typedef int thp_order_fn_t(struct vm_area_struct *vma,
> > > > >                             enum bpf_thp_vma_type vma_type,
> > > > >                             enum tva_type tva_type,
> > > > >                             unsigned long orders);
> > > > >
> > > > > Only a single BPF program can be attached at any given time, though it can
> > > > > be dynamically updated to adjust the policy. The implementation supports
> > > > > anonymous THP, shmem THP, and mTHP, with future extensions planned for
> > > > > file-backed THP.
> > > > >
> > > > > This functionality is only active when system-wide THP is configured to
> > > > > madvise or always mode. It remains disabled in never mode. Additionally,
> > > > > if THP is explicitly disabled for a specific task via prctl(), this BPF
> > > > > functionality will also be unavailable for that task.
> > > > >
> > > > > This feature requires CONFIG_BPF_GET_THP_ORDER (marked EXPERIMENTAL) to be
> > > > > enabled. Note that this capability is currently unstable and may undergo
> > > > > significant changes—including potential removal—in future kernel versions.
> > > > >
> > > > > Suggested-by: David Hildenbrand <david@redhat.com>
> > > > > Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > > > > ---
> > > > [...]
> > > > > diff --git a/mm/huge_memory_bpf.c b/mm/huge_memory_bpf.c
> > > > > new file mode 100644
> > > > > index 000000000000..525ee22ab598
> > > > > --- /dev/null
> > > > > +++ b/mm/huge_memory_bpf.c
> > > > > @@ -0,0 +1,243 @@
> > > > > +// SPDX-License-Identifier: GPL-2.0
> > > > > +/*
> > > > > + * BPF-based THP policy management
> > > > > + *
> > > > > + * Author: Yafang Shao <laoar.shao@gmail.com>
> > > > > + */
> > > > > +
> > > > > +#include <linux/bpf.h>
> > > > > +#include <linux/btf.h>
> > > > > +#include <linux/huge_mm.h>
> > > > > +#include <linux/khugepaged.h>
> > > > > +
> > > > > +enum bpf_thp_vma_type {
> > > > > +       BPF_THP_VM_NONE = 0,
> > > > > +       BPF_THP_VM_HUGEPAGE,    /* VM_HUGEPAGE */
> > > > > +       BPF_THP_VM_NOHUGEPAGE,  /* VM_NOHUGEPAGE */
> > > > > +};
> > > > > +
> > > > > +/**
> > > > > + * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation
> > > > > + * @vma: vm_area_struct associated with the THP allocation
> > > > > + * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
> > > > > + *            BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if
> > > > > + *            neither is set.
> > > > > + * @tva_type: TVA type for current @vma
> > > > > + * @orders: Bitmask of requested THP orders for this allocation
> > > > > + *          - PMD-mapped allocation if PMD_ORDER is set
> > > > > + *          - mTHP allocation otherwise
> > > > > + *
> > > > > + * Return: The suggested THP order from the BPF program for allocation. It will
> > > > > + *         not exceed the highest requested order in @orders. Return -1 to
> > > > > + *         indicate that the original requested @orders should remain unchanged.
> > > >
> > > > A minor documentation nit: the comment says "Return -1 to indicate that the
> > > > original requested @orders should remain unchanged". It might be slightly
> > > > clearer to say "Return a negative value to fall back to the original
> > > > behavior". This would cover all error codes as well ;)
> > > >
> > > > > + */
> > > > > +typedef int thp_order_fn_t(struct vm_area_struct *vma,
> > > > > +                          enum bpf_thp_vma_type vma_type,
> > > > > +                          enum tva_type tva_type,
> > > > > +                          unsigned long orders);
> > > >
> > > > Sorry if I'm missing some context here since I haven't tracked the whole
> > > > series closely.
> > > >
> > > > Regarding the return value for thp_order_fn_t: right now it returns a
> > > > single int order. I was thinking, what if we let it return an unsigned
> > > > long bitmask of orders instead? This seems like it would be more flexible
> > > > down the road, especially if we get more mTHP sizes to choose from. It
> > > > would also make the API more consistent, as bpf_hook_thp_get_orders()
> > > > itself returns an unsigned long ;)
> > >
> > > I think that adds confusion - as in how an order might be chosen from
> > > those. Also we have _received_ a bitmap of available orders - and the intent
> > > here is to select _which one we should use_.
> >
> > Yep. Makes sense to me ;)
>
> Thanks :)
>
> >
> > >
> > > And this is an experimental feature, behind a flag explicitly labelled as
> > > experimental (and thus subject to change) so if we found we needed to change
> > > things in the future we can.
> >
> > You're right, I didn't pay enough attention to the fact that this is
> > an experimental feature. So my suggestions were based on a lack of
> > context ...
>
> It's fine, don't worry :) these are sensible suggestions - it to me highlights
> that we haven't been clear enough perhaps.
>
> >
> > >
> > > >
> > > > Also, for future extensions, it might be a good idea to add a reserved
> > > > flags argument to the thp_order_fn_t signature.
> > >
> > > We don't need to do anything like this, as we are behind an experimental flag
> > > and in no way guarantee that this will be used this way going forwards.
> > > >
> > > > For example thp_order_fn_t(..., unsigned long flags).
> > > >
> > > > This would give us aforward-compatible way to add new semantics later
> > > > without breaking the ABI and needing a v2. We could just require it to be
> > > > 0 for now.
> > >
> > > There is no ABI.
> > >
> > > I mean again to emphasise, this is an _experimental_ feature not to be relied
> > > upon in production.
> > >
> > > >
> > > > Thanks for the great work!
> > > > Lance
> > >
> > > Perhaps we need to put a 'EXPERIMENTAL_' prefix on the config flag too to really
> > > bring this home, as it's perhaps not all that clear :)
> >
> > No need for a 'EXPERIMENTAL_' prefix, it was just me missing
> > the background. Appreciate you clarifying this!
>
> Don't worry about it, but also it suggests that we probably need to be
> ultra-super clear to users in general. So I think an _EXPERIMENTAL suffix is
> probably pretty valid here just to _hammer home_ that - hey - we might break
> you! :)

I will add it. Thanks for the reminder.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 02/10] mm: thp: add support for BPF based THP order selection
  2025-09-11 14:51   ` Lorenzo Stoakes
@ 2025-09-12  8:03     ` Yafang Shao
  2025-09-12 12:00       ` Lorenzo Stoakes
  0 siblings, 1 reply; 61+ messages in thread
From: Yafang Shao @ 2025-09-12  8:03 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, 21cnbao,
	shakeel.butt, bpf, linux-mm, linux-doc

On Thu, Sep 11, 2025 at 10:51 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Wed, Sep 10, 2025 at 10:44:39AM +0800, Yafang Shao wrote:
> > diff --git a/mm/huge_memory_bpf.c b/mm/huge_memory_bpf.c
> > new file mode 100644
> > index 000000000000..525ee22ab598
> > --- /dev/null
> > +++ b/mm/huge_memory_bpf.c
>
> [snip]
>
> > +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
> > +                                   vm_flags_t vma_flags,
> > +                                   enum tva_type tva_type,
> > +                                   unsigned long orders)
> > +{
> > +     thp_order_fn_t *bpf_hook_thp_get_order;
> > +     unsigned long thp_orders = orders;
> > +     enum bpf_thp_vma_type vma_type;
> > +     int thp_order;
> > +
> > +     /* No BPF program is attached */
> > +     if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> > +                   &transparent_hugepage_flags))
> > +             return orders;
> > +
> > +     if (vma_flags & VM_HUGEPAGE)
> > +             vma_type = BPF_THP_VM_HUGEPAGE;
> > +     else if (vma_flags & VM_NOHUGEPAGE)
> > +             vma_type = BPF_THP_VM_NOHUGEPAGE;
> > +     else
> > +             vma_type = BPF_THP_VM_NONE;
> > +
> > +     rcu_read_lock();
> > +     bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
> > +     if (!bpf_hook_thp_get_order)
> > +             goto out;
> > +
> > +     thp_order = bpf_hook_thp_get_order(vma, vma_type, tva_type, orders);
> > +     if (thp_order < 0)
> > +             goto out;
> > +     /*
> > +      * The maximum requested order is determined by the callsite. E.g.:
> > +      * - PMD-mapped THP uses PMD_ORDER
> > +      * - mTHP uses (PMD_ORDER - 1)
> > +      *
> > +      * We must respect this upper bound to avoid undefined behavior. So the
> > +      * highest suggested order can't exceed the highest requested order.
> > +      */
> > +     if (thp_order <= highest_order(orders))
> > +             thp_orders = BIT(thp_order);
>
> OK so looking at Lance's reply re: setting 0 and what we're doing here in
> general - this seems a bit weird to me.
>
> Shouldn't orders be specifying a _mask_ as to which orders are _available_,
> rather than allowing a user to specify an arbitrary order?
>
> So if you're a position where the only possible order is PMD sized, now this
> would let you arbitrarily select an mTHP right? That does no seem correct.
>
> And as per Lance, if we cannot satisfy the requested order, we shouldn't fall
> back to available orders, we should take that as a signal that we cannot have
> THP at all.
>
> So shouldn't this just be:
>
>         thp_orders = orders & BIT(thp_order);

That's better.  I will change it.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 02/10] mm: thp: add support for BPF based THP order selection
  2025-09-11 14:33   ` Lorenzo Stoakes
@ 2025-09-12  8:28     ` Yafang Shao
  2025-09-12 11:53       ` Lorenzo Stoakes
  0 siblings, 1 reply; 61+ messages in thread
From: Yafang Shao @ 2025-09-12  8:28 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, 21cnbao,
	shakeel.butt, bpf, linux-mm, linux-doc

On Thu, Sep 11, 2025 at 10:34 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Wed, Sep 10, 2025 at 10:44:39AM +0800, Yafang Shao wrote:
> > This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
> > THP tuning. It includes a hook bpf_hook_thp_get_order(), allowing BPF
> > programs to influence THP order selection based on factors such as:
> > - Workload identity
> >   For example, workloads running in specific containers or cgroups.
> > - Allocation context
> >   Whether the allocation occurs during a page fault, khugepaged, swap or
> >   other paths.
> > - VMA's memory advice settings
> >   MADV_HUGEPAGE or MADV_NOHUGEPAGE
> > - Memory pressure
> >   PSI system data or associated cgroup PSI metrics
> >
> > The kernel API of this new BPF hook is as follows,
> >
> > /**
> >  * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation
> >  * @vma: vm_area_struct associated with the THP allocation
> >  * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
> >  *            BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if
> >  *            neither is set.
> >  * @tva_type: TVA type for current @vma
> >  * @orders: Bitmask of requested THP orders for this allocation
> >  *          - PMD-mapped allocation if PMD_ORDER is set
> >  *          - mTHP allocation otherwise
> >  *
> >  * Return: The suggested THP order from the BPF program for allocation. It will
> >  *         not exceed the highest requested order in @orders. Return -1 to
> >  *         indicate that the original requested @orders should remain unchanged.
> >  */
> > typedef int thp_order_fn_t(struct vm_area_struct *vma,
> >                          enum bpf_thp_vma_type vma_type,
> >                          enum tva_type tva_type,
> >                          unsigned long orders);
> >
> > Only a single BPF program can be attached at any given time, though it can
> > be dynamically updated to adjust the policy. The implementation supports
> > anonymous THP, shmem THP, and mTHP, with future extensions planned for
> > file-backed THP.
> >
> > This functionality is only active when system-wide THP is configured to
> > madvise or always mode. It remains disabled in never mode. Additionally,
> > if THP is explicitly disabled for a specific task via prctl(), this BPF
> > functionality will also be unavailable for that task.
> >
> > This feature requires CONFIG_BPF_GET_THP_ORDER (marked EXPERIMENTAL) to be
> > enabled. Note that this capability is currently unstable and may undergo
> > significant changes—including potential removal—in future kernel versions.
>
> Thanks for highlighting.
>
> >
> > Suggested-by: David Hildenbrand <david@redhat.com>
> > Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  MAINTAINERS             |   1 +
> >  include/linux/huge_mm.h |  26 ++++-
> >  mm/Kconfig              |  12 ++
> >  mm/Makefile             |   1 +
> >  mm/huge_memory_bpf.c    | 243 ++++++++++++++++++++++++++++++++++++++++
> >  5 files changed, 280 insertions(+), 3 deletions(-)
> >  create mode 100644 mm/huge_memory_bpf.c
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 8fef05bc2224..d055a3c95300 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -16252,6 +16252,7 @@ F:    include/linux/huge_mm.h
> >  F:   include/linux/khugepaged.h
> >  F:   include/trace/events/huge_memory.h
> >  F:   mm/huge_memory.c
> > +F:   mm/huge_memory_bpf.c
>
> THanks!
>
> >  F:   mm/khugepaged.c
> >  F:   mm/mm_slot.h
> >  F:   tools/testing/selftests/mm/khugepaged.c
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 23f124493c47..f72a5fd04e4f 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -56,6 +56,7 @@ enum transparent_hugepage_flag {
> >       TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
> >       TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
> >       TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
> > +     TRANSPARENT_HUGEPAGE_BPF_ATTACHED,      /* BPF prog is attached */
> >  };
> >
> >  struct kobject;
> > @@ -270,6 +271,19 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
> >                                        enum tva_type type,
> >                                        unsigned long orders);
> >
> > +#ifdef CONFIG_BPF_GET_THP_ORDER
> > +unsigned long
> > +bpf_hook_thp_get_orders(struct vm_area_struct *vma, vm_flags_t vma_flags,
> > +                     enum tva_type type, unsigned long orders);
>
> Thanks for renaming!
>
> > +#else
> > +static inline unsigned long
> > +bpf_hook_thp_get_orders(struct vm_area_struct *vma, vm_flags_t vma_flags,
> > +                     enum tva_type tva_flags, unsigned long orders)
> > +{
> > +     return orders;
> > +}
> > +#endif
> > +
> >  /**
> >   * thp_vma_allowable_orders - determine hugepage orders that are allowed for vma
> >   * @vma:  the vm area to check
> > @@ -291,6 +305,12 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
> >                                      enum tva_type type,
> >                                      unsigned long orders)
> >  {
> > +     unsigned long bpf_orders;
> > +
> > +     bpf_orders = bpf_hook_thp_get_orders(vma, vm_flags, type, orders);
> > +     if (!bpf_orders)
> > +             return 0;
>
> I think it'd be easier to just do:
>
>         /* The BPF-specified order overrides which order is selected. */
>         orders &= bpf_hook_thp_get_orders(vma, vm_flags, type, orders);
>         if (!orders)
>                 return 0;

good suggestion!

>
> > +
> >       /*
> >        * Optimization to check if required orders are enabled early. Only
> >        * forced collapse ignores sysfs configs.
> > @@ -304,12 +324,12 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
> >                   ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()))
> >                       mask |= READ_ONCE(huge_anon_orders_inherit);
> >
> > -             orders &= mask;
> > -             if (!orders)
> > +             bpf_orders &= mask;
> > +             if (!bpf_orders)
> >                       return 0
>
> With my suggeted change this would remain the same.
>
> >       }
> >
> > -     return __thp_vma_allowable_orders(vma, vm_flags, type, orders);
> > +     return __thp_vma_allowable_orders(vma, vm_flags, type, bpf_orders);
>
> With my suggeted change this would remain the same.
>
> >  }
> >
> >  struct thpsize {
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index d1ed839ca710..4d89d2158f10 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -896,6 +896,18 @@ config NO_PAGE_MAPCOUNT
> >
> >         EXPERIMENTAL because the impact of some changes is still unclear.
> >
> > +config BPF_GET_THP_ORDER
>
> Yeah, I think we maybe need to sledgehammer this as already Lance was confused
> as to the permenancy of this, and I feel that users might be too, even with the
> '(EXPERIMENTAL)' bit.
>
> So maybe
>
> config BPF_GET_THP_ORDER_EXPERIMENTAL
>
> Just to hammer it home?

ack

>
> > +     bool "BPF-based THP order selection (EXPERIMENTAL)"
> > +     depends on TRANSPARENT_HUGEPAGE && BPF_SYSCALL
> > +
> > +     help
> > +       Enable dynamic THP order selection using BPF programs. This
> > +       experimental feature allows custom BPF logic to determine optimal
> > +       transparent hugepage allocation sizes at runtime.
> > +
> > +       WARNING: This feature is unstable and may change in future kernel
> > +       versions.
> > +
> >  endif # TRANSPARENT_HUGEPAGE
> >
> >  # simple helper to make the code a bit easier to read
> > diff --git a/mm/Makefile b/mm/Makefile
> > index 21abb3353550..f180332f2ad0 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -99,6 +99,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
> >  obj-$(CONFIG_NUMA) += memory-tiers.o
> >  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
> >  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> > +obj-$(CONFIG_BPF_GET_THP_ORDER) += huge_memory_bpf.o
> >  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
> >  obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
> >  obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
> > diff --git a/mm/huge_memory_bpf.c b/mm/huge_memory_bpf.c
> > new file mode 100644
> > index 000000000000..525ee22ab598
> > --- /dev/null
> > +++ b/mm/huge_memory_bpf.c
> > @@ -0,0 +1,243 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * BPF-based THP policy management
> > + *
> > + * Author: Yafang Shao <laoar.shao@gmail.com>
> > + */
> > +
> > +#include <linux/bpf.h>
> > +#include <linux/btf.h>
> > +#include <linux/huge_mm.h>
> > +#include <linux/khugepaged.h>
> > +
> > +enum bpf_thp_vma_type {
> > +     BPF_THP_VM_NONE = 0,
> > +     BPF_THP_VM_HUGEPAGE,    /* VM_HUGEPAGE */
> > +     BPF_THP_VM_NOHUGEPAGE,  /* VM_NOHUGEPAGE */
> > +};
>
> I'm really not so sure how useful this is - can't a user just ascertain this
> from the VMA flags themselves?

I assume you are referring to checking flags from vma->vm_flags.
There is an exception where we cannot use vma->vm_flags: in
hugepage_madvise(), which calls khugepaged_enter_vma(vma, *vm_flags).

At this point, the VM_HUGEPAGE flag has not been set in vma->vm_flags
yet. Therefore, we must pass the separate *vm_flags variable.
Perhaps we can simplify the logic with the following change?

diff --git a/mm/madvise.c b/mm/madvise.c
index 35ed4ab0d7c5..5755de80a4d7 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1425,6 +1425,8 @@ static int madvise_vma_behavior(struct
madvise_behavior *madv_behavior)
        VM_WARN_ON_ONCE(madv_behavior->lock_mode != MADVISE_MMAP_WRITE_LOCK);

        error = madvise_update_vma(new_flags, madv_behavior);
+       if (new_flags & VM_HUGEPAGE)
+               khugepaged_enter_vma(vma);
 out:
        /*
         * madvise() returns EAGAIN if kernel resources, such as

>
> Let's keep the interface as minimal as possible.
>
> > +
> > +/**
> > + * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation
>
> orders -> order?

ack

>
> > + * @vma: vm_area_struct associated with the THP allocation
> > + * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
> > + *            BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if
> > + *            neither is set.
>
> Obv as above let's drop this probably :)
>
> > + * @tva_type: TVA type for current @vma
> > + * @orders: Bitmask of requested THP orders for this allocation
>
> Shouldn't requested = available?

ack

>
> > + *          - PMD-mapped allocation if PMD_ORDER is set
> > + *          - mTHP allocation otherwise
>
> Not sure these 2 points are super useful.

will remove it.

>
> > + *
> > + * Return: The suggested THP order from the BPF program for allocation. It will
> > + *         not exceed the highest requested order in @orders. Return -1 to
> > + *         indicate that the original requested @orders should remain unchanged.
> > + */
> > +typedef int thp_order_fn_t(struct vm_area_struct *vma,
> > +                        enum bpf_thp_vma_type vma_type,
> > +                        enum tva_type tva_type,
> > +                        unsigned long orders);
> > +
> > +struct bpf_thp_ops {
> > +     thp_order_fn_t __rcu *thp_get_order;
> > +};
> > +
> > +static struct bpf_thp_ops bpf_thp;
> > +static DEFINE_SPINLOCK(thp_ops_lock);
> > +
> > +/*
> > + * Returns the original @orders if no BPF program is attached or if the
> > + * suggested order is invalid.
> > + */
> > +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
> > +                                   vm_flags_t vma_flags,
> > +                                   enum tva_type tva_type,
> > +                                   unsigned long orders)
> > +{
> > +     thp_order_fn_t *bpf_hook_thp_get_order;
> > +     unsigned long thp_orders = orders;
> > +     enum bpf_thp_vma_type vma_type;
> > +     int thp_order;
> > +
> > +     /* No BPF program is attached */
> > +     if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> > +                   &transparent_hugepage_flags))
> > +             return orders;
> > +
> > +     if (vma_flags & VM_HUGEPAGE)
> > +             vma_type = BPF_THP_VM_HUGEPAGE;
> > +     else if (vma_flags & VM_NOHUGEPAGE)
> > +             vma_type = BPF_THP_VM_NOHUGEPAGE;
> > +     else
> > +             vma_type = BPF_THP_VM_NONE;
>
> As per above, not sure this is all that useful.
>
> > +
> > +     rcu_read_lock();
> > +     bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
> > +     if (!bpf_hook_thp_get_order)
> > +             goto out;
> > +
> > +     thp_order = bpf_hook_thp_get_order(vma, vma_type, tva_type, orders);
> > +     if (thp_order < 0)
> > +             goto out;
> > +     /*
> > +      * The maximum requested order is determined by the callsite. E.g.:
> > +      * - PMD-mapped THP uses PMD_ORDER
> > +      * - mTHP uses (PMD_ORDER - 1)
>
> I don't think this is quite right, highest_order() figures out the highest set
> bit, so mTHP can be PMD_ORDER - 1 or less (in theory ofc).
>
> I think we can just replace this with something simpler like - 'depending on
> where the BPF hook is invoked, we check for either PMD order or mTHP orders
> (less than PMD order)' or something.

ack

>
> > +      *
> > +      * We must respect this upper bound to avoid undefined behavior. So the
> > +      * highest suggested order can't exceed the highest requested order.
> > +      */
>
> I think this sentence is also unnecessary.

will remove it.

-- 
Regards
Yafang


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 02/10] mm: thp: add support for BPF based THP order selection
  2025-09-12  8:28     ` Yafang Shao
@ 2025-09-12 11:53       ` Lorenzo Stoakes
  2025-09-14  2:22         ` Yafang Shao
  0 siblings, 1 reply; 61+ messages in thread
From: Lorenzo Stoakes @ 2025-09-12 11:53 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, 21cnbao,
	shakeel.butt, bpf, linux-mm, linux-doc

On Fri, Sep 12, 2025 at 04:28:46PM +0800, Yafang Shao wrote:
> On Thu, Sep 11, 2025 at 10:34 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Wed, Sep 10, 2025 at 10:44:39AM +0800, Yafang Shao wrote:
> > > This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
> > > THP tuning. It includes a hook bpf_hook_thp_get_order(), allowing BPF
> > > programs to influence THP order selection based on factors such as:
> > > - Workload identity
> > >   For example, workloads running in specific containers or cgroups.
> > > - Allocation context
> > >   Whether the allocation occurs during a page fault, khugepaged, swap or
> > >   other paths.
> > > - VMA's memory advice settings
> > >   MADV_HUGEPAGE or MADV_NOHUGEPAGE
> > > - Memory pressure
> > >   PSI system data or associated cgroup PSI metrics
> > >
> > > The kernel API of this new BPF hook is as follows,
> > >
> > > /**
> > >  * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation
> > >  * @vma: vm_area_struct associated with the THP allocation
> > >  * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
> > >  *            BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if
> > >  *            neither is set.
> > >  * @tva_type: TVA type for current @vma
> > >  * @orders: Bitmask of requested THP orders for this allocation
> > >  *          - PMD-mapped allocation if PMD_ORDER is set
> > >  *          - mTHP allocation otherwise
> > >  *
> > >  * Return: The suggested THP order from the BPF program for allocation. It will
> > >  *         not exceed the highest requested order in @orders. Return -1 to
> > >  *         indicate that the original requested @orders should remain unchanged.
> > >  */
> > > typedef int thp_order_fn_t(struct vm_area_struct *vma,
> > >                          enum bpf_thp_vma_type vma_type,
> > >                          enum tva_type tva_type,
> > >                          unsigned long orders);
> > >
> > > Only a single BPF program can be attached at any given time, though it can
> > > be dynamically updated to adjust the policy. The implementation supports
> > > anonymous THP, shmem THP, and mTHP, with future extensions planned for
> > > file-backed THP.
> > >
> > > This functionality is only active when system-wide THP is configured to
> > > madvise or always mode. It remains disabled in never mode. Additionally,
> > > if THP is explicitly disabled for a specific task via prctl(), this BPF
> > > functionality will also be unavailable for that task.
> > >
> > > This feature requires CONFIG_BPF_GET_THP_ORDER (marked EXPERIMENTAL) to be
> > > enabled. Note that this capability is currently unstable and may undergo
> > > significant changes—including potential removal—in future kernel versions.
> >
> > Thanks for highlighting.
> >
> > >
> > > Suggested-by: David Hildenbrand <david@redhat.com>
> > > Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > > ---
> > >  MAINTAINERS             |   1 +
> > >  include/linux/huge_mm.h |  26 ++++-
> > >  mm/Kconfig              |  12 ++
> > >  mm/Makefile             |   1 +
> > >  mm/huge_memory_bpf.c    | 243 ++++++++++++++++++++++++++++++++++++++++
> > >  5 files changed, 280 insertions(+), 3 deletions(-)
> > >  create mode 100644 mm/huge_memory_bpf.c
> > >
> > > diff --git a/MAINTAINERS b/MAINTAINERS
> > > index 8fef05bc2224..d055a3c95300 100644
> > > --- a/MAINTAINERS
> > > +++ b/MAINTAINERS
> > > @@ -16252,6 +16252,7 @@ F:    include/linux/huge_mm.h
> > >  F:   include/linux/khugepaged.h
> > >  F:   include/trace/events/huge_memory.h
> > >  F:   mm/huge_memory.c
> > > +F:   mm/huge_memory_bpf.c
> >
> > THanks!
> >
> > >  F:   mm/khugepaged.c
> > >  F:   mm/mm_slot.h
> > >  F:   tools/testing/selftests/mm/khugepaged.c
> > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > > index 23f124493c47..f72a5fd04e4f 100644
> > > --- a/include/linux/huge_mm.h
> > > +++ b/include/linux/huge_mm.h
> > > @@ -56,6 +56,7 @@ enum transparent_hugepage_flag {
> > >       TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
> > >       TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
> > >       TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
> > > +     TRANSPARENT_HUGEPAGE_BPF_ATTACHED,      /* BPF prog is attached */
> > >  };
> > >
> > >  struct kobject;
> > > @@ -270,6 +271,19 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
> > >                                        enum tva_type type,
> > >                                        unsigned long orders);
> > >
> > > +#ifdef CONFIG_BPF_GET_THP_ORDER
> > > +unsigned long
> > > +bpf_hook_thp_get_orders(struct vm_area_struct *vma, vm_flags_t vma_flags,
> > > +                     enum tva_type type, unsigned long orders);
> >
> > Thanks for renaming!
> >
> > > +#else
> > > +static inline unsigned long
> > > +bpf_hook_thp_get_orders(struct vm_area_struct *vma, vm_flags_t vma_flags,
> > > +                     enum tva_type tva_flags, unsigned long orders)
> > > +{
> > > +     return orders;
> > > +}
> > > +#endif
> > > +
> > >  /**
> > >   * thp_vma_allowable_orders - determine hugepage orders that are allowed for vma
> > >   * @vma:  the vm area to check
> > > @@ -291,6 +305,12 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
> > >                                      enum tva_type type,
> > >                                      unsigned long orders)
> > >  {
> > > +     unsigned long bpf_orders;
> > > +
> > > +     bpf_orders = bpf_hook_thp_get_orders(vma, vm_flags, type, orders);
> > > +     if (!bpf_orders)
> > > +             return 0;
> >
> > I think it'd be easier to just do:
> >
> >         /* The BPF-specified order overrides which order is selected. */
> >         orders &= bpf_hook_thp_get_orders(vma, vm_flags, type, orders);
> >         if (!orders)
> >                 return 0;
>
> good suggestion!

Thanks, though this does come back to 'are we masking on orders' or not.

Obviously this is predicated on that being the case.

> > >  struct thpsize {
> > > diff --git a/mm/Kconfig b/mm/Kconfig
> > > index d1ed839ca710..4d89d2158f10 100644
> > > --- a/mm/Kconfig
> > > +++ b/mm/Kconfig
> > > @@ -896,6 +896,18 @@ config NO_PAGE_MAPCOUNT
> > >
> > >         EXPERIMENTAL because the impact of some changes is still unclear.
> > >
> > > +config BPF_GET_THP_ORDER
> >
> > Yeah, I think we maybe need to sledgehammer this as already Lance was confused
> > as to the permenancy of this, and I feel that users might be too, even with the
> > '(EXPERIMENTAL)' bit.
> >
> > So maybe
> >
> > config BPF_GET_THP_ORDER_EXPERIMENTAL
> >
> > Just to hammer it home?
>
> ack

Thanks!

>
> >
> > > +     bool "BPF-based THP order selection (EXPERIMENTAL)"
> > > +     depends on TRANSPARENT_HUGEPAGE && BPF_SYSCALL
> > > +
> > > +     help
> > > +       Enable dynamic THP order selection using BPF programs. This
> > > +       experimental feature allows custom BPF logic to determine optimal
> > > +       transparent hugepage allocation sizes at runtime.
> > > +
> > > +       WARNING: This feature is unstable and may change in future kernel
> > > +       versions.
> > > +
> > >  endif # TRANSPARENT_HUGEPAGE
> > >
> > >  # simple helper to make the code a bit easier to read
> > > diff --git a/mm/Makefile b/mm/Makefile
> > > index 21abb3353550..f180332f2ad0 100644
> > > --- a/mm/Makefile
> > > +++ b/mm/Makefile
> > > @@ -99,6 +99,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
> > >  obj-$(CONFIG_NUMA) += memory-tiers.o
> > >  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
> > >  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> > > +obj-$(CONFIG_BPF_GET_THP_ORDER) += huge_memory_bpf.o
> > >  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
> > >  obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
> > >  obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
> > > diff --git a/mm/huge_memory_bpf.c b/mm/huge_memory_bpf.c
> > > new file mode 100644
> > > index 000000000000..525ee22ab598
> > > --- /dev/null
> > > +++ b/mm/huge_memory_bpf.c
> > > @@ -0,0 +1,243 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +/*
> > > + * BPF-based THP policy management
> > > + *
> > > + * Author: Yafang Shao <laoar.shao@gmail.com>
> > > + */
> > > +
> > > +#include <linux/bpf.h>
> > > +#include <linux/btf.h>
> > > +#include <linux/huge_mm.h>
> > > +#include <linux/khugepaged.h>
> > > +
> > > +enum bpf_thp_vma_type {
> > > +     BPF_THP_VM_NONE = 0,
> > > +     BPF_THP_VM_HUGEPAGE,    /* VM_HUGEPAGE */
> > > +     BPF_THP_VM_NOHUGEPAGE,  /* VM_NOHUGEPAGE */
> > > +};
> >
> > I'm really not so sure how useful this is - can't a user just ascertain this
> > from the VMA flags themselves?
>
> I assume you are referring to checking flags from vma->vm_flags.
> There is an exception where we cannot use vma->vm_flags: in
> hugepage_madvise(), which calls khugepaged_enter_vma(vma, *vm_flags).
>
> At this point, the VM_HUGEPAGE flag has not been set in vma->vm_flags
> yet. Therefore, we must pass the separate *vm_flags variable.
> Perhaps we can simplify the logic with the following change?

Ugh god.

I guess this is the workaround for the vm_flags thing right.

>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 35ed4ab0d7c5..5755de80a4d7 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1425,6 +1425,8 @@ static int madvise_vma_behavior(struct
> madvise_behavior *madv_behavior)
>         VM_WARN_ON_ONCE(madv_behavior->lock_mode != MADVISE_MMAP_WRITE_LOCK);
>
>         error = madvise_update_vma(new_flags, madv_behavior);
> +       if (new_flags & VM_HUGEPAGE)
> +               khugepaged_enter_vma(vma);

Hm ok, that's not such a bad idea, though ofc this should be something like:

	if (!error && (new_flags & VM_HUGEPAGE))
		khugepaged_enter_vma(vma);

And obviously dropping this khugepaged_enter_vma() from hugepage_madvise().

>  out:
>         /*
>          * madvise() returns EAGAIN if kernel resources, such as
>
> >
> > Let's keep the interface as minimal as possible.
> >
> > > +
> > > +/**
> > > + * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation
> >
> > orders -> order?
>
> ack

Thanks!

>
> >
> > > + * @vma: vm_area_struct associated with the THP allocation
> > > + * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
> > > + *            BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if
> > > + *            neither is set.
> >
> > Obv as above let's drop this probably :)
> >
> > > + * @tva_type: TVA type for current @vma
> > > + * @orders: Bitmask of requested THP orders for this allocation
> >
> > Shouldn't requested = available?
>
> ack

Thanks!

>
> >
> > > + *          - PMD-mapped allocation if PMD_ORDER is set
> > > + *          - mTHP allocation otherwise
> >
> > Not sure these 2 points are super useful.
>
> will remove it.

Thanks!

>
> >
> > > + *
> > > + * Return: The suggested THP order from the BPF program for allocation. It will
> > > + *         not exceed the highest requested order in @orders. Return -1 to
> > > + *         indicate that the original requested @orders should remain unchanged.
> > > + */
> > > +typedef int thp_order_fn_t(struct vm_area_struct *vma,
> > > +                        enum bpf_thp_vma_type vma_type,
> > > +                        enum tva_type tva_type,
> > > +                        unsigned long orders);
> > > +
> > > +struct bpf_thp_ops {
> > > +     thp_order_fn_t __rcu *thp_get_order;
> > > +};
> > > +
> > > +static struct bpf_thp_ops bpf_thp;
> > > +static DEFINE_SPINLOCK(thp_ops_lock);
> > > +
> > > +/*
> > > + * Returns the original @orders if no BPF program is attached or if the
> > > + * suggested order is invalid.
> > > + */
> > > +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
> > > +                                   vm_flags_t vma_flags,
> > > +                                   enum tva_type tva_type,
> > > +                                   unsigned long orders)
> > > +{
> > > +     thp_order_fn_t *bpf_hook_thp_get_order;
> > > +     unsigned long thp_orders = orders;
> > > +     enum bpf_thp_vma_type vma_type;
> > > +     int thp_order;
> > > +
> > > +     /* No BPF program is attached */
> > > +     if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> > > +                   &transparent_hugepage_flags))
> > > +             return orders;
> > > +
> > > +     if (vma_flags & VM_HUGEPAGE)
> > > +             vma_type = BPF_THP_VM_HUGEPAGE;
> > > +     else if (vma_flags & VM_NOHUGEPAGE)
> > > +             vma_type = BPF_THP_VM_NOHUGEPAGE;
> > > +     else
> > > +             vma_type = BPF_THP_VM_NONE;
> >
> > As per above, not sure this is all that useful.
> >
> > > +
> > > +     rcu_read_lock();
> > > +     bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
> > > +     if (!bpf_hook_thp_get_order)
> > > +             goto out;
> > > +
> > > +     thp_order = bpf_hook_thp_get_order(vma, vma_type, tva_type, orders);
> > > +     if (thp_order < 0)
> > > +             goto out;
> > > +     /*
> > > +      * The maximum requested order is determined by the callsite. E.g.:
> > > +      * - PMD-mapped THP uses PMD_ORDER
> > > +      * - mTHP uses (PMD_ORDER - 1)
> >
> > I don't think this is quite right, highest_order() figures out the highest set
> > bit, so mTHP can be PMD_ORDER - 1 or less (in theory ofc).
> >
> > I think we can just replace this with something simpler like - 'depending on
> > where the BPF hook is invoked, we check for either PMD order or mTHP orders
> > (less than PMD order)' or something.
>
> ack

Thanks!

>
> >
> > > +      *
> > > +      * We must respect this upper bound to avoid undefined behavior. So the
> > > +      * highest suggested order can't exceed the highest requested order.
> > > +      */
> >
> > I think this sentence is also unnecessary.
>
> will remove it.

Thanks!

>
> --
> Regards
> Yafang

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 02/10] mm: thp: add support for BPF based THP order selection
  2025-09-12  8:03     ` Yafang Shao
@ 2025-09-12 12:00       ` Lorenzo Stoakes
  0 siblings, 0 replies; 61+ messages in thread
From: Lorenzo Stoakes @ 2025-09-12 12:00 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, 21cnbao,
	shakeel.butt, bpf, linux-mm, linux-doc

On Fri, Sep 12, 2025 at 04:03:32PM +0800, Yafang Shao wrote:
> On Thu, Sep 11, 2025 at 10:51 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Wed, Sep 10, 2025 at 10:44:39AM +0800, Yafang Shao wrote:
> > > diff --git a/mm/huge_memory_bpf.c b/mm/huge_memory_bpf.c
> > > new file mode 100644
> > > index 000000000000..525ee22ab598
> > > --- /dev/null
> > > +++ b/mm/huge_memory_bpf.c
> >
> > [snip]
> >
> > > +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
> > > +                                   vm_flags_t vma_flags,
> > > +                                   enum tva_type tva_type,
> > > +                                   unsigned long orders)
> > > +{
> > > +     thp_order_fn_t *bpf_hook_thp_get_order;
> > > +     unsigned long thp_orders = orders;
> > > +     enum bpf_thp_vma_type vma_type;
> > > +     int thp_order;
> > > +
> > > +     /* No BPF program is attached */
> > > +     if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> > > +                   &transparent_hugepage_flags))
> > > +             return orders;
> > > +
> > > +     if (vma_flags & VM_HUGEPAGE)
> > > +             vma_type = BPF_THP_VM_HUGEPAGE;
> > > +     else if (vma_flags & VM_NOHUGEPAGE)
> > > +             vma_type = BPF_THP_VM_NOHUGEPAGE;
> > > +     else
> > > +             vma_type = BPF_THP_VM_NONE;
> > > +
> > > +     rcu_read_lock();
> > > +     bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
> > > +     if (!bpf_hook_thp_get_order)
> > > +             goto out;
> > > +
> > > +     thp_order = bpf_hook_thp_get_order(vma, vma_type, tva_type, orders);
> > > +     if (thp_order < 0)
> > > +             goto out;
> > > +     /*
> > > +      * The maximum requested order is determined by the callsite. E.g.:
> > > +      * - PMD-mapped THP uses PMD_ORDER
> > > +      * - mTHP uses (PMD_ORDER - 1)
> > > +      *
> > > +      * We must respect this upper bound to avoid undefined behavior. So the
> > > +      * highest suggested order can't exceed the highest requested order.
> > > +      */
> > > +     if (thp_order <= highest_order(orders))
> > > +             thp_orders = BIT(thp_order);
> >
> > OK so looking at Lance's reply re: setting 0 and what we're doing here in
> > general - this seems a bit weird to me.
> >
> > Shouldn't orders be specifying a _mask_ as to which orders are _available_,
> > rather than allowing a user to specify an arbitrary order?
> >
> > So if you're a position where the only possible order is PMD sized, now this
> > would let you arbitrarily select an mTHP right? That does no seem correct.
> >
> > And as per Lance, if we cannot satisfy the requested order, we shouldn't fall
> > back to available orders, we should take that as a signal that we cannot have
> > THP at all.
> >
> > So shouldn't this just be:
> >
> >         thp_orders = orders & BIT(thp_order);
>
> That's better.  I will change it.

Great I agree (obviously :P)

Thanks!

>
> --
> Regards
> Yafang

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 02/10] mm: thp: add support for BPF based THP order selection
  2025-09-12  7:58           ` Yafang Shao
@ 2025-09-12 12:04             ` Lorenzo Stoakes
  0 siblings, 0 replies; 61+ messages in thread
From: Lorenzo Stoakes @ 2025-09-12 12:04 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Lance Yang, akpm, david, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, hannes, usamaarif642, gutierrez.asier,
	willy, ast, daniel, andrii, ameryhung, rientjes, corbet, 21cnbao,
	shakeel.butt, bpf, linux-mm, linux-doc

On Fri, Sep 12, 2025 at 03:58:28PM +0800, Yafang Shao wrote:
> > > >
> > > > Perhaps we need to put a 'EXPERIMENTAL_' prefix on the config flag too to really
> > > > bring this home, as it's perhaps not all that clear :)
> > >
> > > No need for a 'EXPERIMENTAL_' prefix, it was just me missing
> > > the background. Appreciate you clarifying this!
> >
> > Don't worry about it, but also it suggests that we probably need to be
> > ultra-super clear to users in general. So I think an _EXPERIMENTAL suffix is
> > probably pretty valid here just to _hammer home_ that - hey - we might break
> > you! :)
>
> I will add it. Thanks for the reminder.

Thanks!

>
> --
> Regards
> Yafang

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 03/10] mm: thp: decouple THP allocation between swap and page fault paths
  2025-09-12  7:20     ` Yafang Shao
@ 2025-09-12 12:04       ` Lorenzo Stoakes
  0 siblings, 0 replies; 61+ messages in thread
From: Lorenzo Stoakes @ 2025-09-12 12:04 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, 21cnbao,
	shakeel.butt, bpf, linux-mm, linux-doc

On Fri, Sep 12, 2025 at 03:20:38PM +0800, Yafang Shao wrote:
> On Thu, Sep 11, 2025 at 10:56 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Wed, Sep 10, 2025 at 10:44:40AM +0800, Yafang Shao wrote:
> > > The new BPF capability enables finer-grained THP policy decisions by
> > > introducing separate handling for swap faults versus normal page faults.
> > >
> > > As highlighted by Barry:
> > >
> > >   We’ve observed that swapping in large folios can lead to more
> > >   swap thrashing for some workloads- e.g. kernel build. Consequently,
> > >   some workloads might prefer swapping in smaller folios than those
> > >   allocated by alloc_anon_folio().
> > >
> > > While prtcl() could potentially be extended to leverage this new policy,
> > > doing so would require modifications to the uAPI.
> > >
> > > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> >
> > Other than nits, these seems fine, so:
> >
> > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >
> > > Cc: Barry Song <21cnbao@gmail.com>
> > > ---
> > >  include/linux/huge_mm.h | 3 ++-
> > >  mm/huge_memory.c        | 2 +-
> > >  mm/memory.c             | 2 +-
> > >  3 files changed, 4 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > > index f72a5fd04e4f..b9742453806f 100644
> > > --- a/include/linux/huge_mm.h
> > > +++ b/include/linux/huge_mm.h
> > > @@ -97,9 +97,10 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
> > >
> > >  enum tva_type {
> > >       TVA_SMAPS,              /* Exposing "THPeligible:" in smaps. */
> > > -     TVA_PAGEFAULT,          /* Serving a page fault. */
> > > +     TVA_PAGEFAULT,          /* Serving a non-swap page fault. */
> > >       TVA_KHUGEPAGED,         /* Khugepaged collapse. */
> > >       TVA_FORCED_COLLAPSE,    /* Forced collapse (e.g. MADV_COLLAPSE). */
> > > +     TVA_SWAP,               /* Serving a swap */
> >
> > Serving a swap what? :) I think TVA_SWAP_PAGEFAULT would be better here right?
> > And 'serving a swap page fault'.
>
> will change it. Thanks for your suggestion.

Thanks!

>
> --
> Regards
> Yafang

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 04/10] mm: thp: enable THP allocation exclusively through khugepaged
  2025-09-12  6:17     ` Yafang Shao
@ 2025-09-12 13:48       ` Lorenzo Stoakes
  2025-09-14  2:19         ` Yafang Shao
  0 siblings, 1 reply; 61+ messages in thread
From: Lorenzo Stoakes @ 2025-09-12 13:48 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, 21cnbao,
	shakeel.butt, bpf, linux-mm, linux-doc

On Fri, Sep 12, 2025 at 02:17:01PM +0800, Yafang Shao wrote:
> On Thu, Sep 11, 2025 at 11:58 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Wed, Sep 10, 2025 at 10:44:41AM +0800, Yafang Shao wrote:
> > > Currently, THP allocation cannot be restricted to khugepaged alone while
> > > being disabled in the page fault path. This limitation exists because
> > > disabling THP allocation during page faults also prevents the execution of
> > > khugepaged_enter_vma() in that path.
> >
> > This is quite confusing, I see what you mean - you want to be able to disable
> > page fault THP but not khugepaged THP _at the point of possibly faulting in a
> > THP aligned VMA_.
> >
> > It seems this patch makes khugepaged_enter_vma() unconditional for an anonymous
> > VMA, rather than depending on the return value specified by
> > thp_vma_allowable_order().
>
> The functions thp_vma_allowable_order(TVA_PAGEFAULT) and
> thp_vma_allowable_order(TVA_KHUGEPAGED) are functionally equivalent
> within the page fault handler; they always yield the same result.
> Consequently, their execution order is irrelevant.

It seems hard to definitely demonstrate that by checking !in_pf vs not in this
situation :) but it seems broadly true afaict.

So they differ only in that one starts khugepaged, the other tries to
establish a THP on fault via create_huge_pmd().

>
> The change reorders these two calls and, in doing so, also moves the
> call to vmf_anon_prepare(vmf). This alters the control flow:
> - before this change:  The logic checked the return value of
> vmf_anon_prepare() between the two thp_vma_allowable_order() calls.
>
>     thp_vma_allowable_order(TVA_PAGEFAULT);
>     ret = vmf_anon_prepare(vmf);
>     if (ret)
>         return ret;
>     thp_vma_allowable_order(TVA_KHUGEPAGED);

I mean it's also _only if_ the TVA_PAGEFAULT invocation succeeds that the
TVA_KHUGEPAGED one happens.

>
>  - after this change: The logic now executes both
> thp_vma_allowable_order() calls first and does not check the return
> value of vmf_anon_prepare().
>
>     thp_vma_allowable_order(TVA_KHUGEPAGED);
>     thp_vma_allowable_order(TVA_PAGEFAULT);
>     ret = vmf_anon_prepare(vmf); // Return value 'ret' is ignored.

Hm this is confusing, your code does:

+       if (pmd_none(*vmf.pmd)) {
+               if (vma_is_anonymous(vma))
+                       khugepaged_enter_vma(vma, vm_flags);
+               if (thp_vma_allowable_order(vma, vm_flags, TVA_PAGEFAULT, PMD_ORDER)) {
+                       ret = create_huge_pmd(&vmf);
+                       if (!(ret & VM_FAULT_FALLBACK))
+                               return ret;
+               }

So the ret is absolutely not ignored, but whether it succeeds or not, we still
invoke khugepaged_enter_vma().

Previously we would not have one this had vmf_anon_prepare() failed in
do_huge_pmd_anonymous_page().

Which I guess is what you mean?

>
> This change is safe because the return value of vmf_anon_prepare() can
> be safely ignored. This function checks for transient system-level
> conditions (e.g., memory pressure, THP availability) that might
> prevent an immediate THP allocation. It does not guarantee that a
> subsequent allocation will succeed.
>
> This behavior is consistent with the policy in hugepage_madvise(),
> where a VMA is queued for khugepaged before a definitive allocation
> check. If the system is under pressure, khugepaged will simply retry
> the allocation at a more opportune time.

OK. I do note though that the khugepaged being kicked off is at mm_struct level.

So us trying to invoke khugepaged on the mm again is about.. something having
changed that would previously have prevented us but now doesn't?

That is, a product of thp_vma_allowable_order() right?

So probably a sysfs change or similar?

But I guess it makes sense to hook in BPF whenever this is the case because this
_could_ be the point at which khugepaged enters the mm, and we want to select
the allowable order at this time.

So on basis of the two checks being effectively equivalent (on assumption this
is always the case) then the change is fairly reasonable.

Though I would put this information, that the checks are equivalent, in the
commit message so it's really clear.

>
> >
> > So I think a clearer explanation is:
> >
> >         khugepaged_enter_vma() ultimately invokes any attached BPF function with
> >         the TVA_KHUGEPAGED flag set when determining whether or not to enable
> >         khugepaged THP for a freshly faulted in VMA.
> >
> >         Currently, on fault, we invoke this in do_huge_pmd_anonymous_page(), as
> >         invoked by create_huge_pmd() and only when we have already checked to
> >         see if an allowable TVA_PAGEFAULT order is specified.
> >
> >         Since we might want to disallow THP on fault-in but allow it via
> >         khugepaged, we move things around so we always attempt to enter
> >         khugepaged upon fault.
>
> Thanks for the clarification.

Thanks

>
> >
> > Having said all this, I'm very confused.
> >
> > Why are we doing this?
> >
> > We only enable khugepaged _early_ when we know we're faulting in a huge PMD
> > here.
> >
> > I guess we do this because, if we are allowed to do the pagefault, maybe
> > something changed that might have previously disallowed khugepaged to run for
> > the mm.
> >
> > But now we're just checking unconditionally for... no reason?
>
> I have blamed the change history of do_huge_pmd_anonymous_page() but
> was unable to find any rationale for placing khugepaged_enter_vma()
> after the vmf_anon_prepare() check. I therefore believe this ordering
> is likely unintentional.

Right, yeah.

>
> >
> > if BPF disables page fault but not khugepaged, then surely the mm would already
> > be under be khugepaged if it could be?
>
> The behavior you describe applies to the madvise mode, not the always
> mode. To reiterate: the hugepage_madvise() function unconditionally
> adds the memory mm to the khugepaged queue, whereas the page fault
> handler employs conditional logic.

Right, so we suggest the order

>
> >
> > It's sort of immaterial if we get a pmd_none() that is not-faultable for
> > whatever reason but BPF might say is khugepaged'able, because it'd have already
> > set this.
> >
> > This is because if we just map a new VMA, we already let khugepaged have it via
> > khugepaged_enter_vma() in __mmap_new_vma() and in the merge paths.
> >
> > I mean maybe I'm missing something here :)
> >
> > >
> > > With the introduction of BPF, we can now implement THP policies based on
> > > different TVA types. This patch adjusts the logic to support this new
> > > capability.
> > >
> > > While we could also extend prtcl() to utilize this new policy, such a
> >
> > Typo: prtcl -> prctl
>
> thanks

Cheers

>
> >
> > > change would require a uAPI modification.
> >
> > Hm, in what respect? PR_SET_THP_DISABLE?
>
> Right, when can extend PR_SET_THP_DISABLE() to support this logic as well.

Yeah let's not touch that please :)

>
> --
> Regards
> Yafang

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 04/10] mm: thp: enable THP allocation exclusively through khugepaged
  2025-09-12 13:48       ` Lorenzo Stoakes
@ 2025-09-14  2:19         ` Yafang Shao
  0 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2025-09-14  2:19 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, 21cnbao,
	shakeel.butt, bpf, linux-mm, linux-doc

On Fri, Sep 12, 2025 at 9:48 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Fri, Sep 12, 2025 at 02:17:01PM +0800, Yafang Shao wrote:
> > On Thu, Sep 11, 2025 at 11:58 PM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > On Wed, Sep 10, 2025 at 10:44:41AM +0800, Yafang Shao wrote:
> > > > Currently, THP allocation cannot be restricted to khugepaged alone while
> > > > being disabled in the page fault path. This limitation exists because
> > > > disabling THP allocation during page faults also prevents the execution of
> > > > khugepaged_enter_vma() in that path.
> > >
> > > This is quite confusing, I see what you mean - you want to be able to disable
> > > page fault THP but not khugepaged THP _at the point of possibly faulting in a
> > > THP aligned VMA_.
> > >
> > > It seems this patch makes khugepaged_enter_vma() unconditional for an anonymous
> > > VMA, rather than depending on the return value specified by
> > > thp_vma_allowable_order().
> >
> > The functions thp_vma_allowable_order(TVA_PAGEFAULT) and
> > thp_vma_allowable_order(TVA_KHUGEPAGED) are functionally equivalent
> > within the page fault handler; they always yield the same result.
> > Consequently, their execution order is irrelevant.
>
> It seems hard to definitely demonstrate that by checking !in_pf vs not in this
> situation :) but it seems broadly true afaict.
>
> So they differ only in that one starts khugepaged, the other tries to
> establish a THP on fault via create_huge_pmd().

right

>
> >
> > The change reorders these two calls and, in doing so, also moves the
> > call to vmf_anon_prepare(vmf). This alters the control flow:
> > - before this change:  The logic checked the return value of
> > vmf_anon_prepare() between the two thp_vma_allowable_order() calls.
> >
> >     thp_vma_allowable_order(TVA_PAGEFAULT);
> >     ret = vmf_anon_prepare(vmf);
> >     if (ret)
> >         return ret;
> >     thp_vma_allowable_order(TVA_KHUGEPAGED);
>
> I mean it's also _only if_ the TVA_PAGEFAULT invocation succeeds that the
> TVA_KHUGEPAGED one happens.
>
> >
> >  - after this change: The logic now executes both
> > thp_vma_allowable_order() calls first and does not check the return
> > value of vmf_anon_prepare().
> >
> >     thp_vma_allowable_order(TVA_KHUGEPAGED);
> >     thp_vma_allowable_order(TVA_PAGEFAULT);
> >     ret = vmf_anon_prepare(vmf); // Return value 'ret' is ignored.
>
> Hm this is confusing, your code does:
>
> +       if (pmd_none(*vmf.pmd)) {
> +               if (vma_is_anonymous(vma))
> +                       khugepaged_enter_vma(vma, vm_flags);
> +               if (thp_vma_allowable_order(vma, vm_flags, TVA_PAGEFAULT, PMD_ORDER)) {
> +                       ret = create_huge_pmd(&vmf);
> +                       if (!(ret & VM_FAULT_FALLBACK))
> +                               return ret;
> +               }
>
> So the ret is absolutely not ignored, but whether it succeeds or not, we still
> invoke khugepaged_enter_vma().
>
> Previously we would not have one this had vmf_anon_prepare() failed in
> do_huge_pmd_anonymous_page().
>
> Which I guess is what you mean?
>
> >
> > This change is safe because the return value of vmf_anon_prepare() can
> > be safely ignored. This function checks for transient system-level
> > conditions (e.g., memory pressure, THP availability) that might
> > prevent an immediate THP allocation. It does not guarantee that a
> > subsequent allocation will succeed.
> >
> > This behavior is consistent with the policy in hugepage_madvise(),
> > where a VMA is queued for khugepaged before a definitive allocation
> > check. If the system is under pressure, khugepaged will simply retry
> > the allocation at a more opportune time.
>
> OK. I do note though that the khugepaged being kicked off is at mm_struct level.

The unit of operation for khugepaged is the mm_struct itself. It
processes the entire mm even when only a single VMA within it is a
candidate for a THP.

>
> So us trying to invoke khugepaged on the mm again is about.. something having
> changed that would previously have prevented us but now doesn't?
>
> That is, a product of thp_vma_allowable_order() right?
>
> So probably a sysfs change or similar?
>
> But I guess it makes sense to hook in BPF whenever this is the case because this
> _could_ be the point at which khugepaged enters the mm, and we want to select
> the allowable order at this time.
>
> So on basis of the two checks being effectively equivalent (on assumption this
> is always the case) then the change is fairly reasonable.

Yes, that is exactly what I mean.

>
> Though I would put this information, that the checks are equivalent, in the
> commit message so it's really clear.

will add it.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 02/10] mm: thp: add support for BPF based THP order selection
  2025-09-12 11:53       ` Lorenzo Stoakes
@ 2025-09-14  2:22         ` Yafang Shao
  0 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2025-09-14  2:22 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, 21cnbao,
	shakeel.butt, bpf, linux-mm, linux-doc

On Fri, Sep 12, 2025 at 7:53 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Fri, Sep 12, 2025 at 04:28:46PM +0800, Yafang Shao wrote:
> > On Thu, Sep 11, 2025 at 10:34 PM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > On Wed, Sep 10, 2025 at 10:44:39AM +0800, Yafang Shao wrote:
> > > > This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
> > > > THP tuning. It includes a hook bpf_hook_thp_get_order(), allowing BPF
> > > > programs to influence THP order selection based on factors such as:
> > > > - Workload identity
> > > >   For example, workloads running in specific containers or cgroups.
> > > > - Allocation context
> > > >   Whether the allocation occurs during a page fault, khugepaged, swap or
> > > >   other paths.
> > > > - VMA's memory advice settings
> > > >   MADV_HUGEPAGE or MADV_NOHUGEPAGE
> > > > - Memory pressure
> > > >   PSI system data or associated cgroup PSI metrics
> > > >
> > > > The kernel API of this new BPF hook is as follows,
> > > >
> > > > /**
> > > >  * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation
> > > >  * @vma: vm_area_struct associated with the THP allocation
> > > >  * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
> > > >  *            BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if
> > > >  *            neither is set.
> > > >  * @tva_type: TVA type for current @vma
> > > >  * @orders: Bitmask of requested THP orders for this allocation
> > > >  *          - PMD-mapped allocation if PMD_ORDER is set
> > > >  *          - mTHP allocation otherwise
> > > >  *
> > > >  * Return: The suggested THP order from the BPF program for allocation. It will
> > > >  *         not exceed the highest requested order in @orders. Return -1 to
> > > >  *         indicate that the original requested @orders should remain unchanged.
> > > >  */
> > > > typedef int thp_order_fn_t(struct vm_area_struct *vma,
> > > >                          enum bpf_thp_vma_type vma_type,
> > > >                          enum tva_type tva_type,
> > > >                          unsigned long orders);
> > > >
> > > > Only a single BPF program can be attached at any given time, though it can
> > > > be dynamically updated to adjust the policy. The implementation supports
> > > > anonymous THP, shmem THP, and mTHP, with future extensions planned for
> > > > file-backed THP.
> > > >
> > > > This functionality is only active when system-wide THP is configured to
> > > > madvise or always mode. It remains disabled in never mode. Additionally,
> > > > if THP is explicitly disabled for a specific task via prctl(), this BPF
> > > > functionality will also be unavailable for that task.
> > > >
> > > > This feature requires CONFIG_BPF_GET_THP_ORDER (marked EXPERIMENTAL) to be
> > > > enabled. Note that this capability is currently unstable and may undergo
> > > > significant changes—including potential removal—in future kernel versions.
> > >
> > > Thanks for highlighting.
> > >
> > > >
> > > > Suggested-by: David Hildenbrand <david@redhat.com>
> > > > Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > > > ---
> > > >  MAINTAINERS             |   1 +
> > > >  include/linux/huge_mm.h |  26 ++++-
> > > >  mm/Kconfig              |  12 ++
> > > >  mm/Makefile             |   1 +
> > > >  mm/huge_memory_bpf.c    | 243 ++++++++++++++++++++++++++++++++++++++++
> > > >  5 files changed, 280 insertions(+), 3 deletions(-)
> > > >  create mode 100644 mm/huge_memory_bpf.c
> > > >
> > > > diff --git a/MAINTAINERS b/MAINTAINERS
> > > > index 8fef05bc2224..d055a3c95300 100644
> > > > --- a/MAINTAINERS
> > > > +++ b/MAINTAINERS
> > > > @@ -16252,6 +16252,7 @@ F:    include/linux/huge_mm.h
> > > >  F:   include/linux/khugepaged.h
> > > >  F:   include/trace/events/huge_memory.h
> > > >  F:   mm/huge_memory.c
> > > > +F:   mm/huge_memory_bpf.c
> > >
> > > THanks!
> > >
> > > >  F:   mm/khugepaged.c
> > > >  F:   mm/mm_slot.h
> > > >  F:   tools/testing/selftests/mm/khugepaged.c
> > > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > > > index 23f124493c47..f72a5fd04e4f 100644
> > > > --- a/include/linux/huge_mm.h
> > > > +++ b/include/linux/huge_mm.h
> > > > @@ -56,6 +56,7 @@ enum transparent_hugepage_flag {
> > > >       TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
> > > >       TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
> > > >       TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
> > > > +     TRANSPARENT_HUGEPAGE_BPF_ATTACHED,      /* BPF prog is attached */
> > > >  };
> > > >
> > > >  struct kobject;
> > > > @@ -270,6 +271,19 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
> > > >                                        enum tva_type type,
> > > >                                        unsigned long orders);
> > > >
> > > > +#ifdef CONFIG_BPF_GET_THP_ORDER
> > > > +unsigned long
> > > > +bpf_hook_thp_get_orders(struct vm_area_struct *vma, vm_flags_t vma_flags,
> > > > +                     enum tva_type type, unsigned long orders);
> > >
> > > Thanks for renaming!
> > >
> > > > +#else
> > > > +static inline unsigned long
> > > > +bpf_hook_thp_get_orders(struct vm_area_struct *vma, vm_flags_t vma_flags,
> > > > +                     enum tva_type tva_flags, unsigned long orders)
> > > > +{
> > > > +     return orders;
> > > > +}
> > > > +#endif
> > > > +
> > > >  /**
> > > >   * thp_vma_allowable_orders - determine hugepage orders that are allowed for vma
> > > >   * @vma:  the vm area to check
> > > > @@ -291,6 +305,12 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
> > > >                                      enum tva_type type,
> > > >                                      unsigned long orders)
> > > >  {
> > > > +     unsigned long bpf_orders;
> > > > +
> > > > +     bpf_orders = bpf_hook_thp_get_orders(vma, vm_flags, type, orders);
> > > > +     if (!bpf_orders)
> > > > +             return 0;
> > >
> > > I think it'd be easier to just do:
> > >
> > >         /* The BPF-specified order overrides which order is selected. */
> > >         orders &= bpf_hook_thp_get_orders(vma, vm_flags, type, orders);
> > >         if (!orders)
> > >                 return 0;
> >
> > good suggestion!
>
> Thanks, though this does come back to 'are we masking on orders' or not.
>
> Obviously this is predicated on that being the case.
>
> > > >  struct thpsize {
> > > > diff --git a/mm/Kconfig b/mm/Kconfig
> > > > index d1ed839ca710..4d89d2158f10 100644
> > > > --- a/mm/Kconfig
> > > > +++ b/mm/Kconfig
> > > > @@ -896,6 +896,18 @@ config NO_PAGE_MAPCOUNT
> > > >
> > > >         EXPERIMENTAL because the impact of some changes is still unclear.
> > > >
> > > > +config BPF_GET_THP_ORDER
> > >
> > > Yeah, I think we maybe need to sledgehammer this as already Lance was confused
> > > as to the permenancy of this, and I feel that users might be too, even with the
> > > '(EXPERIMENTAL)' bit.
> > >
> > > So maybe
> > >
> > > config BPF_GET_THP_ORDER_EXPERIMENTAL
> > >
> > > Just to hammer it home?
> >
> > ack
>
> Thanks!
>
> >
> > >
> > > > +     bool "BPF-based THP order selection (EXPERIMENTAL)"
> > > > +     depends on TRANSPARENT_HUGEPAGE && BPF_SYSCALL
> > > > +
> > > > +     help
> > > > +       Enable dynamic THP order selection using BPF programs. This
> > > > +       experimental feature allows custom BPF logic to determine optimal
> > > > +       transparent hugepage allocation sizes at runtime.
> > > > +
> > > > +       WARNING: This feature is unstable and may change in future kernel
> > > > +       versions.
> > > > +
> > > >  endif # TRANSPARENT_HUGEPAGE
> > > >
> > > >  # simple helper to make the code a bit easier to read
> > > > diff --git a/mm/Makefile b/mm/Makefile
> > > > index 21abb3353550..f180332f2ad0 100644
> > > > --- a/mm/Makefile
> > > > +++ b/mm/Makefile
> > > > @@ -99,6 +99,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
> > > >  obj-$(CONFIG_NUMA) += memory-tiers.o
> > > >  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
> > > >  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> > > > +obj-$(CONFIG_BPF_GET_THP_ORDER) += huge_memory_bpf.o
> > > >  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
> > > >  obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
> > > >  obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
> > > > diff --git a/mm/huge_memory_bpf.c b/mm/huge_memory_bpf.c
> > > > new file mode 100644
> > > > index 000000000000..525ee22ab598
> > > > --- /dev/null
> > > > +++ b/mm/huge_memory_bpf.c
> > > > @@ -0,0 +1,243 @@
> > > > +// SPDX-License-Identifier: GPL-2.0
> > > > +/*
> > > > + * BPF-based THP policy management
> > > > + *
> > > > + * Author: Yafang Shao <laoar.shao@gmail.com>
> > > > + */
> > > > +
> > > > +#include <linux/bpf.h>
> > > > +#include <linux/btf.h>
> > > > +#include <linux/huge_mm.h>
> > > > +#include <linux/khugepaged.h>
> > > > +
> > > > +enum bpf_thp_vma_type {
> > > > +     BPF_THP_VM_NONE = 0,
> > > > +     BPF_THP_VM_HUGEPAGE,    /* VM_HUGEPAGE */
> > > > +     BPF_THP_VM_NOHUGEPAGE,  /* VM_NOHUGEPAGE */
> > > > +};
> > >
> > > I'm really not so sure how useful this is - can't a user just ascertain this
> > > from the VMA flags themselves?
> >
> > I assume you are referring to checking flags from vma->vm_flags.
> > There is an exception where we cannot use vma->vm_flags: in
> > hugepage_madvise(), which calls khugepaged_enter_vma(vma, *vm_flags).
> >
> > At this point, the VM_HUGEPAGE flag has not been set in vma->vm_flags
> > yet. Therefore, we must pass the separate *vm_flags variable.
> > Perhaps we can simplify the logic with the following change?
>
> Ugh god.
>
> I guess this is the workaround for the vm_flags thing right.
>
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 35ed4ab0d7c5..5755de80a4d7 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -1425,6 +1425,8 @@ static int madvise_vma_behavior(struct
> > madvise_behavior *madv_behavior)
> >         VM_WARN_ON_ONCE(madv_behavior->lock_mode != MADVISE_MMAP_WRITE_LOCK);
> >
> >         error = madvise_update_vma(new_flags, madv_behavior);
> > +       if (new_flags & VM_HUGEPAGE)
> > +               khugepaged_enter_vma(vma);
>
> Hm ok, that's not such a bad idea, though ofc this should be something like:
>
>         if (!error && (new_flags & VM_HUGEPAGE))
>                 khugepaged_enter_vma(vma);

ack

>
> And obviously dropping this khugepaged_enter_vma() from hugepage_madvise().

Thanks for the reminder.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 01/10] mm: thp: remove disabled task from khugepaged_mm_slot
  2025-09-11 13:43   ` Lorenzo Stoakes
@ 2025-09-14  2:47     ` Yafang Shao
  0 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2025-09-14  2:47 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
	dev.jain, hannes, usamaarif642, gutierrez.asier, willy, ast,
	daniel, andrii, ameryhung, rientjes, corbet, 21cnbao,
	shakeel.butt, bpf, linux-mm, linux-doc, Lance Yang

On Thu, Sep 11, 2025 at 9:43 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Wed, Sep 10, 2025 at 10:44:38AM +0800, Yafang Shao wrote:
> > Since a task with MMF_DISABLE_THP_COMPLETELY cannot use THP, remove it from
> > the khugepaged_mm_slot to stop khugepaged from processing it.
> >
> > After this change, the following semantic relationship always holds:
> >
> >   MMF_VM_HUGEPAGE is set     == task is in khugepaged mm_slot
> >   MMF_VM_HUGEPAGE is not set == task is not in khugepaged mm_slot
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > Cc: Lance Yang <ioworker0@gmail.com>
>
> (Obviously on basis of fixing issue bot reported).
>
> > ---
> >  include/linux/khugepaged.h |  1 +
> >  kernel/sys.c               |  6 ++++++
> >  mm/khugepaged.c            | 19 +++++++++----------
> >  3 files changed, 16 insertions(+), 10 deletions(-)
> >
> > diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> > index eb1946a70cff..6cb9107f1006 100644
> > --- a/include/linux/khugepaged.h
> > +++ b/include/linux/khugepaged.h
> > @@ -19,6 +19,7 @@ extern void khugepaged_min_free_kbytes_update(void);
> >  extern bool current_is_khugepaged(void);
> >  extern int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> >                                  bool install_pmd);
> > +bool hugepage_pmd_enabled(void);
>
> Need to provide a !CONFIG_TRANSPARENT_HUGEPAGE version, or to not invoke
> this in a context where CONFIG_TRANSPARENT_HUGEPAGE is specified.
>
> >
> >  static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
> >  {
> > diff --git a/kernel/sys.c b/kernel/sys.c
> > index a46d9b75880b..a1c1e8007f2d 100644
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -8,6 +8,7 @@
> >  #include <linux/export.h>
> >  #include <linux/mm.h>
> >  #include <linux/mm_inline.h>
> > +#include <linux/khugepaged.h>
> >  #include <linux/utsname.h>
> >  #include <linux/mman.h>
> >  #include <linux/reboot.h>
> > @@ -2493,6 +2494,11 @@ static int prctl_set_thp_disable(bool thp_disable, unsigned long flags,
> >               mm_flags_clear(MMF_DISABLE_THP_COMPLETELY, mm);
> >               mm_flags_clear(MMF_DISABLE_THP_EXCEPT_ADVISED, mm);
> >       }
> > +
> > +     if (!mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm) &&
> > +         !mm_flags_test(MMF_VM_HUGEPAGE, mm) &&
> > +         hugepage_pmd_enabled())
> > +             __khugepaged_enter(mm);
>
> Let's refactor this so it's not open-coded.
>
> We can have:
>
> void khugepaged_enter_mm(struct mm_struct *mm)
> {
>         if (mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm))
>                 return;
>         if (mm_flags_test(MMF_VM_HUGEPAGE, mm))
>                 return;
>         if (!hugepage_pmd_enabled())
>                 return;
>
>         __khugepaged_enter(mm);
> }
>
> void khugepaged_enter_vma(struct vm_area_struct *vma,
>                           vm_flags_t vm_flags)
> {
>         if (!thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
>                 return;
>
>         khugepaged_enter_mm(vma->vm_mm);
> }
>
> Then just invoke khugepaged_enter_mm() here.

That's better.

>
>
> >       mmap_write_unlock(current->mm);
> >       return 0;
> >  }
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 4ec324a4c1fe..88ac482fb3a0 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -413,7 +413,7 @@ static inline int hpage_collapse_test_exit_or_disable(struct mm_struct *mm)
> >               mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm);
> >  }
> >
> > -static bool hugepage_pmd_enabled(void)
> > +bool hugepage_pmd_enabled(void)
> >  {
> >       /*
> >        * We cover the anon, shmem and the file-backed case here; file-backed
> > @@ -445,6 +445,7 @@ void __khugepaged_enter(struct mm_struct *mm)
> >
> >       /* __khugepaged_exit() must not run from under us */
> >       VM_BUG_ON_MM(hpage_collapse_test_exit(mm), mm);
> > +     WARN_ON_ONCE(mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm));
>
> Not sure why this needs to be a naked WARN_ON_ONCE()? Seems that'd be a
> programmatic eror, so VM_WARN_ON_ONCE() more appropriate?

ack

>
> Can also change the VM_BUG_ON_MM() to VM_WARN_ON_ONCE_MM() while we're here.

ack

>
> >       if (unlikely(mm_flags_test_and_set(MMF_VM_HUGEPAGE, mm)))
> >               return;
> >
> > @@ -472,7 +473,8 @@ void __khugepaged_enter(struct mm_struct *mm)
> >  void khugepaged_enter_vma(struct vm_area_struct *vma,
> >                         vm_flags_t vm_flags)
> >  {
> > -     if (!mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
> > +     if (!mm_flags_test(MMF_DISABLE_THP_COMPLETELY, vma->vm_mm) &&
> > +         !mm_flags_test(MMF_VM_HUGEPAGE, vma->vm_mm) &&
> >           hugepage_pmd_enabled()) {
> >               if (thp_vma_allowable_order(vma, vm_flags, TVA_KHUGEPAGED, PMD_ORDER))
> >                       __khugepaged_enter(vma->vm_mm);
>
> See above, we can refactor this.

ack

>
> > @@ -1451,16 +1453,13 @@ static void collect_mm_slot(struct khugepaged_mm_slot *mm_slot)
> >
> >       lockdep_assert_held(&khugepaged_mm_lock);
> >
> > -     if (hpage_collapse_test_exit(mm)) {
> > +     if (hpage_collapse_test_exit_or_disable(mm)) {
> >               /* free mm_slot */
> >               hash_del(&slot->hash);
> >               list_del(&slot->mm_node);
> >
> > -             /*
> > -              * Not strictly needed because the mm exited already.
> > -              *
> > -              * mm_flags_clear(MMF_VM_HUGEPAGE, mm);
> > -              */
> > +             /* If the mm is disabled, this flag must be cleared. */
> > +             mm_flags_clear(MMF_VM_HUGEPAGE, mm);
> >
> >               /* khugepaged_mm_lock actually not necessary for the below */
> >               mm_slot_free(mm_slot_cache, mm_slot);
> > @@ -2507,9 +2506,9 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
> >       VM_BUG_ON(khugepaged_scan.mm_slot != mm_slot);
> >       /*
> >        * Release the current mm_slot if this mm is about to die, or
> > -      * if we scanned all vmas of this mm.
> > +      * if we scanned all vmas of this mm, or if this mm is disabled.
> >        */
> > -     if (hpage_collapse_test_exit(mm) || !vma) {
> > +     if (hpage_collapse_test_exit_or_disable(mm) || !vma) {
> >               /*
> >                * Make sure that if mm_users is reaching zero while
> >                * khugepaged runs here, khugepaged_exit will find
>
> Seems reasonable, but makes me wonder if we actually always want to invoke
> hpage_collapse_test_exit_or_disable()?
>
> I guess the VM_BUG_ON() (though it should be a VM_WARN_ON_ONCE()) in
> __khugepaged_enter() is a legit use, but the only other case is
> retract_page_tables().
>
> I wonder if we should change this also? Seems reasonable to.

Right, we can change it also. Thanks for your suggestion.


-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 01/10] mm: thp: remove disabled task from khugepaged_mm_slot
  2025-09-11 13:47         ` Lorenzo Stoakes
@ 2025-09-14  2:48           ` Yafang Shao
  0 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2025-09-14  2:48 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Zi Yan, Lance Yang, llvm, oe-kbuild-all, bpf, linux-mm, linux-doc,
	Lance Yang, akpm, gutierrez.asier, rientjes, andrii, david,
	baolin.wang, Liam.Howlett, ameryhung, ryan.roberts, usamaarif642,
	willy, corbet, npache, dev.jain, 21cnbao, shakeel.butt, ast,
	daniel, hannes, kernel test robot

On Thu, Sep 11, 2025 at 9:47 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Wed, Sep 10, 2025 at 10:28:30PM -0400, Zi Yan wrote:
> > On 10 Sep 2025, at 22:12, Lance Yang wrote:
> >
> > > Hi Yafang,
> > >
> > > On 2025/9/11 01:27, kernel test robot wrote:
> > >> Hi Yafang,
> > >>
> > >> kernel test robot noticed the following build errors:
> > >>
> > >> [auto build test ERROR on akpm-mm/mm-everything]
> > >>
> > >> url:    https://github.com/intel-lab-lkp/linux/commits/Yafang-Shao/mm-thp-remove-disabled-task-from-khugepaged_mm_slot/20250910-144850
> > >> base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> > >> patch link:    https://lore.kernel.org/r/20250910024447.64788-2-laoar.shao%40gmail.com
> > >> patch subject: [PATCH v7 mm-new 01/10] mm: thp: remove disabled task from khugepaged_mm_slot
> > >> config: x86_64-allnoconfig (https://download.01.org/0day-ci/archive/20250911/202509110109.PSgSHb31-lkp@intel.com/config)
> > >> compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
> > >> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250911/202509110109.PSgSHb31-lkp@intel.com/reproduce)
> > >>
> > >> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > >> the same patch/commit), kindly add following tags
> > >> | Reported-by: kernel test robot <lkp@intel.com>
> > >> | Closes: https://lore.kernel.org/oe-kbuild-all/202509110109.PSgSHb31-lkp@intel.com/
> > >>
> > >> All errors (new ones prefixed by >>):
> > >>
> > >>>> kernel/sys.c:2500:6: error: call to undeclared function 'hugepage_pmd_enabled'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
> > >>      2500 |             hugepage_pmd_enabled())
> > >>           |             ^
> > >>>> kernel/sys.c:2501:3: error: call to undeclared function '__khugepaged_enter'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
> > >>      2501 |                 __khugepaged_enter(mm);
> > >>           |                 ^
> > >>     2 errors generated.
> > >
> > > Oops, seems like hugepage_pmd_enabled() and __khugepaged_enter() are only
> > > available when CONFIG_TRANSPARENT_HUGEPAGE is enabled ;)
> > >
> > >>
> > >>
> > >> vim +/hugepage_pmd_enabled +2500 kernel/sys.c
> > >>
> > >>    2471
> > >>    2472    static int prctl_set_thp_disable(bool thp_disable, unsigned long flags,
> > >>    2473                                     unsigned long arg4, unsigned long arg5)
> > >>    2474    {
> > >>    2475            struct mm_struct *mm = current->mm;
> > >>    2476
> > >>    2477            if (arg4 || arg5)
> > >>    2478                    return -EINVAL;
> > >>    2479
> > >>    2480            /* Flags are only allowed when disabling. */
> > >>    2481            if ((!thp_disable && flags) || (flags & ~PR_THP_DISABLE_EXCEPT_ADVISED))
> > >>    2482                    return -EINVAL;
> > >>    2483            if (mmap_write_lock_killable(current->mm))
> > >>    2484                    return -EINTR;
> > >>    2485            if (thp_disable) {
> > >>    2486                    if (flags & PR_THP_DISABLE_EXCEPT_ADVISED) {
> > >>    2487                            mm_flags_clear(MMF_DISABLE_THP_COMPLETELY, mm);
> > >>    2488                            mm_flags_set(MMF_DISABLE_THP_EXCEPT_ADVISED, mm);
> > >>    2489                    } else {
> > >>    2490                            mm_flags_set(MMF_DISABLE_THP_COMPLETELY, mm);
> > >>    2491                            mm_flags_clear(MMF_DISABLE_THP_EXCEPT_ADVISED, mm);
> > >>    2492                    }
> > >>    2493            } else {
> > >>    2494                    mm_flags_clear(MMF_DISABLE_THP_COMPLETELY, mm);
> > >>    2495                    mm_flags_clear(MMF_DISABLE_THP_EXCEPT_ADVISED, mm);
> > >>    2496            }
> > >>    2497
> > >>    2498            if (!mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm) &&
> > >>    2499                !mm_flags_test(MMF_VM_HUGEPAGE, mm) &&
> > >>> 2500                  hugepage_pmd_enabled())
> > >>> 2501                      __khugepaged_enter(mm);
> > >>    2502            mmap_write_unlock(current->mm);
> > >>    2503            return 0;
> > >>    2504    }
> > >>    2505
> > >
> > > So, let's wrap the new logic in an #ifdef CONFIG_TRANSPARENT_HUGEPAGE block.
> > >
> > > diff --git a/kernel/sys.c b/kernel/sys.c
> > > index a1c1e8007f2d..c8600e017933 100644
> > > --- a/kernel/sys.c
> > > +++ b/kernel/sys.c
> > > @@ -2495,10 +2495,13 @@ static int prctl_set_thp_disable(bool thp_disable, unsigned long flags,
> > >                 mm_flags_clear(MMF_DISABLE_THP_EXCEPT_ADVISED, mm);
> > >         }
> > >
> > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > >         if (!mm_flags_test(MMF_DISABLE_THP_COMPLETELY, mm) &&
> > >             !mm_flags_test(MMF_VM_HUGEPAGE, mm) &&
> > >             hugepage_pmd_enabled())
> > >                 __khugepaged_enter(mm);
> > > +#endif
> > > +
> > >         mmap_write_unlock(current->mm);
> > >         return 0;
> > >  }
> >
> > Or in the header file,
> >
> > #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > ...
> > #else
> > bool hugepage_pmd_enabled()
> > {
> >       return false;
> > }
> >
> > int __khugepaged_enter(struct mm_struct *mm)
> > {
> >       return 0;
> > }
>
> It seems we have a convention of just not implementing things here if they're
> necessarily used in core code paths (and _with my suggested change_) it's _just_
> khugepaged that's invoking them).
>
> Anyway with my suggestion we can fix this entirely with:
>
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>
> void khugepaged_enter_mm(struct mm_struct *mm);
>
> #else
>
> void khugepaged_enter_mm(struct mm_struct *mm)
> {
> }
>
> #endif

ack

Thanks for all the suggestions.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 02/10] mm: thp: add support for BPF based THP order selection
  2025-09-10  2:44 ` [PATCH v7 mm-new 02/10] mm: thp: add support for BPF based THP order selection Yafang Shao
                     ` (2 preceding siblings ...)
  2025-09-11 14:51   ` Lorenzo Stoakes
@ 2025-09-25 10:05   ` Lance Yang
  2025-09-25 11:38     ` Yafang Shao
  3 siblings, 1 reply; 61+ messages in thread
From: Lance Yang @ 2025-09-25 10:05 UTC (permalink / raw)
  To: Yafang Shao
  Cc: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, 21cnbao, shakeel.butt, bpf, linux-mm, linux-doc

On Wed, Sep 10, 2025 at 10:53 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
> THP tuning. It includes a hook bpf_hook_thp_get_order(), allowing BPF
> programs to influence THP order selection based on factors such as:
> - Workload identity
>   For example, workloads running in specific containers or cgroups.
> - Allocation context
>   Whether the allocation occurs during a page fault, khugepaged, swap or
>   other paths.
> - VMA's memory advice settings
>   MADV_HUGEPAGE or MADV_NOHUGEPAGE
> - Memory pressure
>   PSI system data or associated cgroup PSI metrics
>
> The kernel API of this new BPF hook is as follows,
>
> /**
>  * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation
>  * @vma: vm_area_struct associated with the THP allocation
>  * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
>  *            BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if
>  *            neither is set.
>  * @tva_type: TVA type for current @vma
>  * @orders: Bitmask of requested THP orders for this allocation
>  *          - PMD-mapped allocation if PMD_ORDER is set
>  *          - mTHP allocation otherwise
>  *
>  * Return: The suggested THP order from the BPF program for allocation. It will
>  *         not exceed the highest requested order in @orders. Return -1 to
>  *         indicate that the original requested @orders should remain unchanged.
>  */
> typedef int thp_order_fn_t(struct vm_area_struct *vma,
>                            enum bpf_thp_vma_type vma_type,
>                            enum tva_type tva_type,
>                            unsigned long orders);
>
> Only a single BPF program can be attached at any given time, though it can
> be dynamically updated to adjust the policy. The implementation supports
> anonymous THP, shmem THP, and mTHP, with future extensions planned for
> file-backed THP.
>
> This functionality is only active when system-wide THP is configured to
> madvise or always mode. It remains disabled in never mode. Additionally,
> if THP is explicitly disabled for a specific task via prctl(), this BPF
> functionality will also be unavailable for that task.
>
> This feature requires CONFIG_BPF_GET_THP_ORDER (marked EXPERIMENTAL) to be
> enabled. Note that this capability is currently unstable and may undergo
> significant changes—including potential removal—in future kernel versions.
>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>

I've tested this patch on my machine, and it works as expected. Using BPF
hooks to control THP is a great step forward!

Tested-by: Lance Yang <lance.yang@linux.dev>

This work also inspires some ideas for another useful hook for THP that I
might propose in the future, once this series is settled and merged ;)

Cheers,
Lance

> ---
>  MAINTAINERS             |   1 +
>  include/linux/huge_mm.h |  26 ++++-
>  mm/Kconfig              |  12 ++
>  mm/Makefile             |   1 +
>  mm/huge_memory_bpf.c    | 243 ++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 280 insertions(+), 3 deletions(-)
>  create mode 100644 mm/huge_memory_bpf.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 8fef05bc2224..d055a3c95300 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -16252,6 +16252,7 @@ F:      include/linux/huge_mm.h
>  F:     include/linux/khugepaged.h
>  F:     include/trace/events/huge_memory.h
>  F:     mm/huge_memory.c
> +F:     mm/huge_memory_bpf.c
>  F:     mm/khugepaged.c
>  F:     mm/mm_slot.h
>  F:     tools/testing/selftests/mm/khugepaged.c
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 23f124493c47..f72a5fd04e4f 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -56,6 +56,7 @@ enum transparent_hugepage_flag {
>         TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
>         TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
>         TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
> +       TRANSPARENT_HUGEPAGE_BPF_ATTACHED,      /* BPF prog is attached */
>  };
>
>  struct kobject;
> @@ -270,6 +271,19 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
>                                          enum tva_type type,
>                                          unsigned long orders);
>
> +#ifdef CONFIG_BPF_GET_THP_ORDER
> +unsigned long
> +bpf_hook_thp_get_orders(struct vm_area_struct *vma, vm_flags_t vma_flags,
> +                       enum tva_type type, unsigned long orders);
> +#else
> +static inline unsigned long
> +bpf_hook_thp_get_orders(struct vm_area_struct *vma, vm_flags_t vma_flags,
> +                       enum tva_type tva_flags, unsigned long orders)
> +{
> +       return orders;
> +}
> +#endif
> +
>  /**
>   * thp_vma_allowable_orders - determine hugepage orders that are allowed for vma
>   * @vma:  the vm area to check
> @@ -291,6 +305,12 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>                                        enum tva_type type,
>                                        unsigned long orders)
>  {
> +       unsigned long bpf_orders;
> +
> +       bpf_orders = bpf_hook_thp_get_orders(vma, vm_flags, type, orders);
> +       if (!bpf_orders)
> +               return 0;
> +
>         /*
>          * Optimization to check if required orders are enabled early. Only
>          * forced collapse ignores sysfs configs.
> @@ -304,12 +324,12 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
>                     ((vm_flags & VM_HUGEPAGE) && hugepage_global_enabled()))
>                         mask |= READ_ONCE(huge_anon_orders_inherit);
>
> -               orders &= mask;
> -               if (!orders)
> +               bpf_orders &= mask;
> +               if (!bpf_orders)
>                         return 0;
>         }
>
> -       return __thp_vma_allowable_orders(vma, vm_flags, type, orders);
> +       return __thp_vma_allowable_orders(vma, vm_flags, type, bpf_orders);
>  }
>
>  struct thpsize {
> diff --git a/mm/Kconfig b/mm/Kconfig
> index d1ed839ca710..4d89d2158f10 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -896,6 +896,18 @@ config NO_PAGE_MAPCOUNT
>
>           EXPERIMENTAL because the impact of some changes is still unclear.
>
> +config BPF_GET_THP_ORDER
> +       bool "BPF-based THP order selection (EXPERIMENTAL)"
> +       depends on TRANSPARENT_HUGEPAGE && BPF_SYSCALL
> +
> +       help
> +         Enable dynamic THP order selection using BPF programs. This
> +         experimental feature allows custom BPF logic to determine optimal
> +         transparent hugepage allocation sizes at runtime.
> +
> +         WARNING: This feature is unstable and may change in future kernel
> +         versions.
> +
>  endif # TRANSPARENT_HUGEPAGE
>
>  # simple helper to make the code a bit easier to read
> diff --git a/mm/Makefile b/mm/Makefile
> index 21abb3353550..f180332f2ad0 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -99,6 +99,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
>  obj-$(CONFIG_NUMA) += memory-tiers.o
>  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> +obj-$(CONFIG_BPF_GET_THP_ORDER) += huge_memory_bpf.o
>  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
>  obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
>  obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
> diff --git a/mm/huge_memory_bpf.c b/mm/huge_memory_bpf.c
> new file mode 100644
> index 000000000000..525ee22ab598
> --- /dev/null
> +++ b/mm/huge_memory_bpf.c
> @@ -0,0 +1,243 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * BPF-based THP policy management
> + *
> + * Author: Yafang Shao <laoar.shao@gmail.com>
> + */
> +
> +#include <linux/bpf.h>
> +#include <linux/btf.h>
> +#include <linux/huge_mm.h>
> +#include <linux/khugepaged.h>
> +
> +enum bpf_thp_vma_type {
> +       BPF_THP_VM_NONE = 0,
> +       BPF_THP_VM_HUGEPAGE,    /* VM_HUGEPAGE */
> +       BPF_THP_VM_NOHUGEPAGE,  /* VM_NOHUGEPAGE */
> +};
> +
> +/**
> + * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation
> + * @vma: vm_area_struct associated with the THP allocation
> + * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
> + *            BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if
> + *            neither is set.
> + * @tva_type: TVA type for current @vma
> + * @orders: Bitmask of requested THP orders for this allocation
> + *          - PMD-mapped allocation if PMD_ORDER is set
> + *          - mTHP allocation otherwise
> + *
> + * Return: The suggested THP order from the BPF program for allocation. It will
> + *         not exceed the highest requested order in @orders. Return -1 to
> + *         indicate that the original requested @orders should remain unchanged.
> + */
> +typedef int thp_order_fn_t(struct vm_area_struct *vma,
> +                          enum bpf_thp_vma_type vma_type,
> +                          enum tva_type tva_type,
> +                          unsigned long orders);
> +
> +struct bpf_thp_ops {
> +       thp_order_fn_t __rcu *thp_get_order;
> +};
> +
> +static struct bpf_thp_ops bpf_thp;
> +static DEFINE_SPINLOCK(thp_ops_lock);
> +
> +/*
> + * Returns the original @orders if no BPF program is attached or if the
> + * suggested order is invalid.
> + */
> +unsigned long bpf_hook_thp_get_orders(struct vm_area_struct *vma,
> +                                     vm_flags_t vma_flags,
> +                                     enum tva_type tva_type,
> +                                     unsigned long orders)
> +{
> +       thp_order_fn_t *bpf_hook_thp_get_order;
> +       unsigned long thp_orders = orders;
> +       enum bpf_thp_vma_type vma_type;
> +       int thp_order;
> +
> +       /* No BPF program is attached */
> +       if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> +                     &transparent_hugepage_flags))
> +               return orders;
> +
> +       if (vma_flags & VM_HUGEPAGE)
> +               vma_type = BPF_THP_VM_HUGEPAGE;
> +       else if (vma_flags & VM_NOHUGEPAGE)
> +               vma_type = BPF_THP_VM_NOHUGEPAGE;
> +       else
> +               vma_type = BPF_THP_VM_NONE;
> +
> +       rcu_read_lock();
> +       bpf_hook_thp_get_order = rcu_dereference(bpf_thp.thp_get_order);
> +       if (!bpf_hook_thp_get_order)
> +               goto out;
> +
> +       thp_order = bpf_hook_thp_get_order(vma, vma_type, tva_type, orders);
> +       if (thp_order < 0)
> +               goto out;
> +       /*
> +        * The maximum requested order is determined by the callsite. E.g.:
> +        * - PMD-mapped THP uses PMD_ORDER
> +        * - mTHP uses (PMD_ORDER - 1)
> +        *
> +        * We must respect this upper bound to avoid undefined behavior. So the
> +        * highest suggested order can't exceed the highest requested order.
> +        */
> +       if (thp_order <= highest_order(orders))
> +               thp_orders = BIT(thp_order);
> +
> +out:
> +       rcu_read_unlock();
> +       return thp_orders;
> +}
> +
> +static bool bpf_thp_ops_is_valid_access(int off, int size,
> +                                       enum bpf_access_type type,
> +                                       const struct bpf_prog *prog,
> +                                       struct bpf_insn_access_aux *info)
> +{
> +       return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
> +}
> +
> +static const struct bpf_func_proto *
> +bpf_thp_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> +{
> +       return bpf_base_func_proto(func_id, prog);
> +}
> +
> +static const struct bpf_verifier_ops thp_bpf_verifier_ops = {
> +       .get_func_proto = bpf_thp_get_func_proto,
> +       .is_valid_access = bpf_thp_ops_is_valid_access,
> +};
> +
> +static int bpf_thp_init(struct btf *btf)
> +{
> +       return 0;
> +}
> +
> +static int bpf_thp_check_member(const struct btf_type *t,
> +                               const struct btf_member *member,
> +                               const struct bpf_prog *prog)
> +{
> +       /* The call site operates under RCU protection. */
> +       if (prog->sleepable)
> +               return -EINVAL;
> +       return 0;
> +}
> +
> +static int bpf_thp_init_member(const struct btf_type *t,
> +                              const struct btf_member *member,
> +                              void *kdata, const void *udata)
> +{
> +       return 0;
> +}
> +
> +static int bpf_thp_reg(void *kdata, struct bpf_link *link)
> +{
> +       struct bpf_thp_ops *ops = kdata;
> +
> +       spin_lock(&thp_ops_lock);
> +       if (test_and_set_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> +                            &transparent_hugepage_flags)) {
> +               spin_unlock(&thp_ops_lock);
> +               return -EBUSY;
> +       }
> +       WARN_ON_ONCE(rcu_access_pointer(bpf_thp.thp_get_order));
> +       rcu_assign_pointer(bpf_thp.thp_get_order, ops->thp_get_order);
> +       spin_unlock(&thp_ops_lock);
> +       return 0;
> +}
> +
> +static void bpf_thp_unreg(void *kdata, struct bpf_link *link)
> +{
> +       thp_order_fn_t *old_fn;
> +
> +       spin_lock(&thp_ops_lock);
> +       clear_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED, &transparent_hugepage_flags);
> +       old_fn = rcu_replace_pointer(bpf_thp.thp_get_order, NULL,
> +                                    lockdep_is_held(&thp_ops_lock));
> +       WARN_ON_ONCE(!old_fn);
> +       spin_unlock(&thp_ops_lock);
> +
> +       synchronize_rcu();
> +}
> +
> +static int bpf_thp_update(void *kdata, void *old_kdata, struct bpf_link *link)
> +{
> +       thp_order_fn_t *old_fn, *new_fn;
> +       struct bpf_thp_ops *old = old_kdata;
> +       struct bpf_thp_ops *ops = kdata;
> +       int ret = 0;
> +
> +       if (!ops || !old)
> +               return -EINVAL;
> +
> +       spin_lock(&thp_ops_lock);
> +       /* The prog has aleady been removed. */
> +       if (!test_bit(TRANSPARENT_HUGEPAGE_BPF_ATTACHED,
> +                     &transparent_hugepage_flags)) {
> +               ret = -ENOENT;
> +               goto out;
> +       }
> +
> +       new_fn = rcu_dereference(ops->thp_get_order);
> +       old_fn = rcu_replace_pointer(bpf_thp.thp_get_order, new_fn,
> +                                    lockdep_is_held(&thp_ops_lock));
> +       WARN_ON_ONCE(!old_fn || !new_fn);
> +
> +out:
> +       spin_unlock(&thp_ops_lock);
> +       if (!ret)
> +               synchronize_rcu();
> +       return ret;
> +}
> +
> +static int bpf_thp_validate(void *kdata)
> +{
> +       struct bpf_thp_ops *ops = kdata;
> +
> +       if (!ops->thp_get_order) {
> +               pr_err("bpf_thp: required ops isn't implemented\n");
> +               return -EINVAL;
> +       }
> +       return 0;
> +}
> +
> +static int bpf_thp_get_order(struct vm_area_struct *vma,
> +                            enum bpf_thp_vma_type vma_type,
> +                            enum tva_type tva_type,
> +                            unsigned long orders)
> +{
> +       return -1;
> +}
> +
> +static struct bpf_thp_ops __bpf_thp_ops = {
> +       .thp_get_order = (thp_order_fn_t __rcu *)bpf_thp_get_order,
> +};
> +
> +static struct bpf_struct_ops bpf_bpf_thp_ops = {
> +       .verifier_ops = &thp_bpf_verifier_ops,
> +       .init = bpf_thp_init,
> +       .check_member = bpf_thp_check_member,
> +       .init_member = bpf_thp_init_member,
> +       .reg = bpf_thp_reg,
> +       .unreg = bpf_thp_unreg,
> +       .update = bpf_thp_update,
> +       .validate = bpf_thp_validate,
> +       .cfi_stubs = &__bpf_thp_ops,
> +       .owner = THIS_MODULE,
> +       .name = "bpf_thp_ops",
> +};
> +
> +static int __init bpf_thp_ops_init(void)
> +{
> +       int err;
> +
> +       err = register_bpf_struct_ops(&bpf_bpf_thp_ops, bpf_thp_ops);
> +       if (err)
> +               pr_err("bpf_thp: Failed to register struct_ops (%d)\n", err);
> +       return err;
> +}
> +late_initcall(bpf_thp_ops_init);
> --
> 2.47.3
>
>


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v7 mm-new 02/10] mm: thp: add support for BPF based THP order selection
  2025-09-25 10:05   ` Lance Yang
@ 2025-09-25 11:38     ` Yafang Shao
  0 siblings, 0 replies; 61+ messages in thread
From: Yafang Shao @ 2025-09-25 11:38 UTC (permalink / raw)
  To: Lance Yang
  Cc: akpm, david, ziy, baolin.wang, lorenzo.stoakes, Liam.Howlett,
	npache, ryan.roberts, dev.jain, hannes, usamaarif642,
	gutierrez.asier, willy, ast, daniel, andrii, ameryhung, rientjes,
	corbet, 21cnbao, shakeel.butt, bpf, linux-mm, linux-doc

On Thu, Sep 25, 2025 at 6:06 PM Lance Yang <lance.yang@linux.dev> wrote:
>
> On Wed, Sep 10, 2025 at 10:53 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
> > THP tuning. It includes a hook bpf_hook_thp_get_order(), allowing BPF
> > programs to influence THP order selection based on factors such as:
> > - Workload identity
> >   For example, workloads running in specific containers or cgroups.
> > - Allocation context
> >   Whether the allocation occurs during a page fault, khugepaged, swap or
> >   other paths.
> > - VMA's memory advice settings
> >   MADV_HUGEPAGE or MADV_NOHUGEPAGE
> > - Memory pressure
> >   PSI system data or associated cgroup PSI metrics
> >
> > The kernel API of this new BPF hook is as follows,
> >
> > /**
> >  * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation
> >  * @vma: vm_area_struct associated with the THP allocation
> >  * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
> >  *            BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if
> >  *            neither is set.
> >  * @tva_type: TVA type for current @vma
> >  * @orders: Bitmask of requested THP orders for this allocation
> >  *          - PMD-mapped allocation if PMD_ORDER is set
> >  *          - mTHP allocation otherwise
> >  *
> >  * Return: The suggested THP order from the BPF program for allocation. It will
> >  *         not exceed the highest requested order in @orders. Return -1 to
> >  *         indicate that the original requested @orders should remain unchanged.
> >  */
> > typedef int thp_order_fn_t(struct vm_area_struct *vma,
> >                            enum bpf_thp_vma_type vma_type,
> >                            enum tva_type tva_type,
> >                            unsigned long orders);
> >
> > Only a single BPF program can be attached at any given time, though it can
> > be dynamically updated to adjust the policy. The implementation supports
> > anonymous THP, shmem THP, and mTHP, with future extensions planned for
> > file-backed THP.
> >
> > This functionality is only active when system-wide THP is configured to
> > madvise or always mode. It remains disabled in never mode. Additionally,
> > if THP is explicitly disabled for a specific task via prctl(), this BPF
> > functionality will also be unavailable for that task.
> >
> > This feature requires CONFIG_BPF_GET_THP_ORDER (marked EXPERIMENTAL) to be
> > enabled. Note that this capability is currently unstable and may undergo
> > significant changes—including potential removal—in future kernel versions.
> >
> > Suggested-by: David Hildenbrand <david@redhat.com>
> > Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
>
> I've tested this patch on my machine, and it works as expected. Using BPF
> hooks to control THP is a great step forward!
>
> Tested-by: Lance Yang <lance.yang@linux.dev>

Thanks for your test. I will post a new version ASAP.

>
> This work also inspires some ideas for another useful hook for THP that I
> might propose in the future, once this series is settled and merged ;)

Excited to see it!

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2025-09-25 11:38 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-10  2:44 [PATCH v7 mm-new 0/9] mm, bpf: BPF based THP order selection Yafang Shao
2025-09-10  2:44 ` [PATCH v7 mm-new 01/10] mm: thp: remove disabled task from khugepaged_mm_slot Yafang Shao
2025-09-10  5:11   ` Lance Yang
2025-09-10  6:17     ` Yafang Shao
2025-09-10  7:21   ` Lance Yang
2025-09-10 17:27   ` kernel test robot
2025-09-11  2:12     ` Lance Yang
2025-09-11  2:28       ` Zi Yan
2025-09-11  2:35         ` Yafang Shao
2025-09-11  2:38         ` Lance Yang
2025-09-11 13:47         ` Lorenzo Stoakes
2025-09-14  2:48           ` Yafang Shao
2025-09-11 13:43   ` Lorenzo Stoakes
2025-09-14  2:47     ` Yafang Shao
2025-09-10  2:44 ` [PATCH v7 mm-new 02/10] mm: thp: add support for BPF based THP order selection Yafang Shao
2025-09-10 12:42   ` Lance Yang
2025-09-10 12:54     ` Lance Yang
2025-09-10 13:56       ` Lance Yang
2025-09-11  2:48         ` Yafang Shao
2025-09-11  3:04           ` Lance Yang
2025-09-11 14:45         ` Lorenzo Stoakes
2025-09-11 14:02     ` Lorenzo Stoakes
2025-09-11 14:42       ` Lance Yang
2025-09-11 14:58         ` Lorenzo Stoakes
2025-09-12  7:58           ` Yafang Shao
2025-09-12 12:04             ` Lorenzo Stoakes
2025-09-11 14:33   ` Lorenzo Stoakes
2025-09-12  8:28     ` Yafang Shao
2025-09-12 11:53       ` Lorenzo Stoakes
2025-09-14  2:22         ` Yafang Shao
2025-09-11 14:51   ` Lorenzo Stoakes
2025-09-12  8:03     ` Yafang Shao
2025-09-12 12:00       ` Lorenzo Stoakes
2025-09-25 10:05   ` Lance Yang
2025-09-25 11:38     ` Yafang Shao
2025-09-10  2:44 ` [PATCH v7 mm-new 03/10] mm: thp: decouple THP allocation between swap and page fault paths Yafang Shao
2025-09-11 14:55   ` Lorenzo Stoakes
2025-09-12  7:20     ` Yafang Shao
2025-09-12 12:04       ` Lorenzo Stoakes
2025-09-10  2:44 ` [PATCH v7 mm-new 04/10] mm: thp: enable THP allocation exclusively through khugepaged Yafang Shao
2025-09-11 15:53   ` Lance Yang
2025-09-12  6:21     ` Yafang Shao
2025-09-11 15:58   ` Lorenzo Stoakes
2025-09-12  6:17     ` Yafang Shao
2025-09-12 13:48       ` Lorenzo Stoakes
2025-09-14  2:19         ` Yafang Shao
2025-09-10  2:44 ` [PATCH v7 mm-new 05/10] bpf: mark mm->owner as __safe_rcu_or_null Yafang Shao
2025-09-11 16:04   ` Lorenzo Stoakes
2025-09-10  2:44 ` [PATCH v7 mm-new 06/10] bpf: mark vma->vm_mm as __safe_trusted_or_null Yafang Shao
2025-09-11 17:08   ` Lorenzo Stoakes
2025-09-11 17:30   ` Liam R. Howlett
2025-09-11 17:44     ` Lorenzo Stoakes
2025-09-12  3:56       ` Yafang Shao
2025-09-12  3:50     ` Yafang Shao
2025-09-10  2:44 ` [PATCH v7 mm-new 07/10] selftests/bpf: add a simple BPF based THP policy Yafang Shao
2025-09-10 20:44   ` Alexei Starovoitov
2025-09-11  2:31     ` Yafang Shao
2025-09-10  2:44 ` [PATCH v7 mm-new 08/10] selftests/bpf: add test case to update " Yafang Shao
2025-09-10  2:44 ` [PATCH v7 mm-new 09/10] selftests/bpf: add test cases for invalid thp_adjust usage Yafang Shao
2025-09-10  2:44 ` [PATCH v7 mm-new 10/10] Documentation: add BPF-based THP policy management Yafang Shao
2025-09-10 11:11 ` [PATCH v7 mm-new 0/9] mm, bpf: BPF based THP order selection Lance Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).